Closed MarkPflug closed 5 months ago
I added a few unit test cases to compare behavior of certain parser edge-cases.
This covers the following scenarios:
Quoted: "a"
QuotedComma: "a,b"
QuotedQuote: "a""b"
QuotedNewLine: "a\r\nb"
QuotedNewLine is the only one that currently fails, with some libraries failing or producing differing results.
Most libraries produce the correct, expected result a\r\nb
.
The "DSV" library turns the "\r\n" into "\n" in the output: a\nb
The following parsers throw an exception when processing this file:
I don't know if this test will survive git, which has a tendency to modify line ends in files. This might partially satisfy issue #42.
Edit: The newline did not survive git, and was replaced with a \n
, which is what I worried might happen.
The results in this PR are very interesting. If I find some time, I think it would be good to assess quote handling and newline handling per library and add it as a ✔️ or ✖️ to the results table in my blog post. I think I could use this PR as a starting point.
The NCsvPerf benchmark PackageAssets.csv file doesn't contain quotes, but all existing parsers can be configured to produce consistent results in the presence of a quoted field. This PR adds the required configurations for a few odd libraries that don't do this be default. Note, this PR doesn't change the CSV file, so the benchmark still doesn't test correctness. To test this, I modified the PackageAssets.csv and changed >Akinzekeel.BlazorGrid< to >"Akinzekeel,BlazorGrid"< on the first row; adding quotes and a comma in the middle. I exclude string.Split from the unit test, since the naive implementation is expected to produce incorrect results.
This PR is a bit reactionary to @nietras submission of Sep #51, as Sep would be the only library that produces inconsistent results, since it doesn't trim or escape quoted values and I don't see any configuration to enable it. Personally, I see this as fundamental functionality that a CSV library should provide, as without it a user would have to implement their own mechanism to process values. It feels a bit misleading that parser claiming to be the fastest would be producing results that disagrees with every other CSV library. I think most users would be surprised by this behavior, even though it is clearly stated in the docs.