Performance regression with 2.1.0

shollander commented 2 years ago

Thank you for your work on this great product! It's proven performance has significantly improved the performance of our application.

We had initially been using version 1.0.4, which greatly improved the performance of our CSV parsing (over commons-csv which we had been previously using). We recently tried upgrading to version 2.1.0. I like the new API, however, we noticed that there was a significant performance degradation over 1.0.4. We have a little bit of a unique data format that we deal with, which involves an embedded CSV list within a CSV column. This is how our data looks:

NAME,NUMBER,WIDGETS_LIST
john doe,123456,"""thequickbrownfoxjumpedoverthelazydog"""
john smith,7890123,"""thequickbrownfoxjumpedoverthelazydog1"",""thequickbrownfoxjumpedoverthelazydog2"""

The WIDGETS_LIST column is a variable length list that is formatted as an embedded csv string. Each item in the list is usually around 200 characters long.

With fastcsv 1.0.4 we would parse the data with code like this:

class Parser {

  Client parseCsv(Path file) {
    List<Client> clients = new ArrayList<>();
    CsvReader csvReader = new CsvReader();
    try(var parser = csvReader.parse(file, StandardCharsets.UTF_8)) {
      CsvRow row;
      while( (row = parser.nextRow()) != null) {
        String name = row.getField(0);
        String number = row.getField(1);
        List<String> widgets = parseWidgets(row.getField(2));
        clients.add(new Client(name, number, widgets));
      }
    }
    return clients;
  }
  List<String> parseWidgets(String data) {
    CsvReader csvReader = new CsvReader();
    CsvParser parser = csvParser.parser(new StringReader(data));
    CsvRow row = parser.nextRow();
    return row != null ? List.copyOf(row.getFields()) : List.of();
  }

}

With fastcsv 2.1.0 we parse with code like this:

class Parser {

  Client parseCsv(Path file) {
    try(var parser = CsvReader.builder().build(file)) {
      return parser.stream()
         .map(row -> {
             String name = row.getField(0);
             String number = row.getField(1);
             List<String> widgets = parseWidgets(row.getField(2));
             return new Client(name, number, widgets));
         })
         .toList();
  }
  List<String> parseWidgets(String data) {
    return CsvReader.builder().build(data)
        .stream().flatMap(r -> r.getFields().stream())
        .toList();
  }

}

Very surprisingly, the fastcsv 2.1.0 code takes around twice as long to parse the CSV data than version 1.0.4. It seems to be related to the embedded CSV string since for other data without the embedded CSV, 2.1.0 is actually faster than 1.0.4. However, I cannot figure out why the embedded CSV is causing such a significant slow down. To get meaningful performance results we benchmarked with a CSV file containing about 1 million rows, and processed the same file 10 times per run.

Additional context Java distribution and version to be used (output of java -version).

openjdk version "17.0.2" 2022-01-18
OpenJDK Runtime Environment Temurin-17.0.2+8 (build 17.0.2+8)
OpenJDK 64-bit Server VM Temurin-17.0.2+8 (build 17.0.2+8, mixed mode, sharing)

osiegmar commented 2 years ago

Thanks for your feedback! After a quick code comparison between 1.0.4 and 2.1.0 regarding quoted fields I don't have an immediate assumption. I need to analyze this...

BTW: You should keep an instance of a configured CsvReader (return of CsvReader.builder()) in memory and not create new unnecessary instances on each invocation of your parseWidgets method.

shollander commented 2 years ago

Thank you for your suggestion to reuse the CsvReader. I assume you were actually referring to the CsvReaderBuilder and not CsvReader since CsvReader is tied to the data that is being parsed and cannot be reused as far as I can tell.

I can see why it makes sense to reuse the builder. I did try incorporating that, but there was no performance difference.

shollander commented 2 years ago

I have also been having a similar performance issue when writing out these CSV files. Writing out the CSV with fastcsv is almost 3X slower than when using commons-csv. Based on my testing it is clear that the problem is caused by the embedded CSV.

After some experimentation I have determined that the problem is the internal buffering that fastcsv forces clients to use. The buffer forces an 8K allocation of memory for each row processed, which causes a significant slow down. I proved this by patching the CsvWriter class to just use the provided Writer object and not wrap with a CachingWriter, and then my write performance improved significantly and surpassed commons-csv. I did not perform the same test with the CsvReader, since there the Buffer is more entangled in the RowReader code and would be harder for me to remove. However, I am pretty sure that the Buffer is causing the issue there as well.

I would suggest that fastcsv provide a mechanism for users to opt-out of buffering for both CsvWriter and CsvReader. Users should be able to provide their own buffering, or may not need it at all. In my case, the data for the embedded csv is already in memory as a string so buffering is not needed and actually harms performance.

osiegmar commented 2 years ago

Thanks again for your feedback.

I don't understand those two statements, yet:

Based on my testing it is clear that the problem is caused by the embedded CSV.

After some experimentation I have determined that the problem is the internal buffering that fastcsv forces clients to use.

...they seem to conflict as embedded CSV shouldn't be related to the internal buffering.

I invested much time in benchmark comparisons and those tests also include embedded CSV.

Could you provide some sample data or even a JMH test that can be used to do more analysis?

Based on your initial code examples I could imagine, that you're building thousands of new CsvWriter instances in order to write embedded CSV data? That would be clearly a use case scenario I didn't think of when designing the architecture of FastCSV.

shollander commented 2 years ago

There problem is that for each line of the main CSV file, I need to create a CsvWriter or CsvReader to write/parse the embedded CSV. So for a file with 1 million lines, than means 1 million CsvWriters created, each with a 8K buffer allocation.

If the CsvWriter/CsvReader was reusable, that might also solve the problem. However, I think a cleaner solution would be to allow the user to avoid internal buffering.

I provided sample data and code for reading the data in my original post.

Here is some sample code for generating the data:

class TestGenerateCsv {

    @Test
    void testGenerate(@TempDir Path tempDir) throws IOException {
        int iterations = 10;
        for(int i = 0; i < iterations; i++) {
           Path file = Files.createTempFile(tempDir, null, ".csv");
           generateCsv(file);
        }
    }

    private void generateCsv(Path csvFile) throws IOException {
        int rows = 1_000_000;
        try(CsvWriter csvWriter = CsvWriter.builder().build(csvFile, StandardCharsets.UTF_8, StandardOpenOption.TRUNCATE_EXISTING)) {
            for(int i = 0; i < rows; i++) {
                String name = RandomStringUtils.randomAlphabetic(20);
                String number = RandomStringUtils.randomNumeric(10);
                String widgets = generateWidgets();
                csvWriter.writeRow(name, number, widgets);
            }
        }
    }
    private String generateWidgets() {
        int numberOfWidgets = 2;
        List<String> widgets = new ArrayList<>(numberOfWidgets);
        for(int i=0; i < numberOfWidgets; i++) {
            String widget = RandomStringUtils.random(20);
            widgets.add(widget);

        }
        StringWriter writer = new StringWriter();
        CsvWriter csvWriter = CsvWriter.builder().quoteStrategy(QuoteStrategy.ALWAYS).build(writer);
        csvWriter.writeRow(widgets);
        return writer.toString()
                // need to strip line separator that CsvWriter forces on us
                .stripTrailing();
    }

}

osiegmar commented 2 years ago

However, I think a cleaner solution would be to allow the user to avoid internal buffering.

The internal buffering is a crucial part of FastCSVs performance.

The performance is optimized for runtime performance and not for instantiation performance. Unfortunately your use case (CSV data as part of one embedded CSV field) combines both aspects and this is clearly something FastCSV is not designed for.

shollander commented 2 years ago

However, I think a cleaner solution would be to allow the user to avoid internal buffering.

The internal buffering is a crucial part of FastCSVs performance.

I understand that the internal buffering is integral to the design and performance of FastCSV. Perhaps an alternate solution would be to provide a configuration setting to allow the user to set the size of the buffer. This way the user can optimize the buffer size for the use case.

osiegmar commented 2 years ago

You may want to give string-without-buffer branch a spin. This should improve Reader performance for String input.

shollander commented 2 years ago

I tested out the string-without-buffer branch. It definitely significantly improves performance for processing my data. My throughput more than doubled (!) compared to the master branch.

If you can do something similar for the CsvWriter that would really be awesome! (Perhaps providing a separate method for CsvWriterBuilder.build(StringWriter))

I am looking forward to using these new features in an upcoming release. Thank you!

osiegmar commented 2 years ago

You may now try to use CsvWriter.builder().bufferSize(0).build(stringWriter).

shollander commented 2 years ago

I benchmarked the latest code in the string-without-buffer branch for my writing use case. This definitely solves the problem. My write speed is now approximately 49% faster! Thanks!

osiegmar commented 2 years ago

Just released as part of 2.2.0

shollander commented 2 years ago

Awesome, thanks for getting this out in a timely manner!

osiegmar commented 2 years ago

You're welcome. Thank you for your valuable and quick feedback!

osiegmar / FastCSV

Performance regression with 2.1.0 #63