joelverhagen / NCsvPerf

A test bench for various .NET CSV parsing libraries
https://www.joelverhagen.com/blog/2020/12/fastest-net-csv-parsers
MIT License
71 stars 14 forks source link

Make the benchmark unit test pass in the presence of quoted fields. #52

Closed MarkPflug closed 5 months ago

MarkPflug commented 1 year ago

The NCsvPerf benchmark PackageAssets.csv file doesn't contain quotes, but all existing parsers can be configured to produce consistent results in the presence of a quoted field. This PR adds the required configurations for a few odd libraries that don't do this be default. Note, this PR doesn't change the CSV file, so the benchmark still doesn't test correctness. To test this, I modified the PackageAssets.csv and changed >Akinzekeel.BlazorGrid< to >"Akinzekeel,BlazorGrid"< on the first row; adding quotes and a comma in the middle. I exclude string.Split from the unit test, since the naive implementation is expected to produce incorrect results.

This PR is a bit reactionary to @nietras submission of Sep #51, as Sep would be the only library that produces inconsistent results, since it doesn't trim or escape quoted values and I don't see any configuration to enable it. Personally, I see this as fundamental functionality that a CSV library should provide, as without it a user would have to implement their own mechanism to process values. It feels a bit misleading that parser claiming to be the fastest would be producing results that disagrees with every other CSV library. I think most users would be surprised by this behavior, even though it is clearly stated in the docs.

MarkPflug commented 1 year ago

I added a few unit test cases to compare behavior of certain parser edge-cases. This covers the following scenarios: Quoted: "a" QuotedComma: "a,b" QuotedQuote: "a""b" QuotedNewLine: "a\r\nb"

QuotedNewLine is the only one that currently fails, with some libraries failing or producing differing results. Most libraries produce the correct, expected result a\r\nb. The "DSV" library turns the "\r\n" into "\n" in the output: a\nb

The following parsers throw an exception when processing this file:

I don't know if this test will survive git, which has a tendency to modify line ends in files. This might partially satisfy issue #42.

Edit: The newline did not survive git, and was replaced with a \n, which is what I worried might happen.

joelverhagen commented 1 year ago

The results in this PR are very interesting. If I find some time, I think it would be good to assess quote handling and newline handling per library and add it as a ✔️ or ✖️ to the results table in my blog post. I think I could use this PR as a starting point.