joelverhagen / NCsvPerf

A test bench for various .NET CSV parsing libraries
https://www.joelverhagen.com/blog/2020/12/fastest-net-csv-parsers
MIT License
71 stars 14 forks source link

adjust to use recordparser in parallel #45

Closed leandromoh closed 2 years ago

leandromoh commented 2 years ago

eventually this code, as well all methods support, will be integreted in the library. I know there is a lot of boilerplate here that users will not write themselves in a normal use case, but since this performance can be achieved by user when really needed, I think it is a valid benchmark after all.

leandromoh commented 2 years ago

@joelverhagen if possible, I would like to see a result benchmark before merge.

joelverhagen commented 2 years ago

I know there is a lot of boilerplate here that users will not write themselves in a normal use case, but since this performance can be achieved by user when really needed, I think it is a valid benchmark after all.

No worries at all :) I'll let you as the owner of this test to draw the line between code that is written by the user vs. what's in the library.

if possible, I would like to see a result benchmark before merge.

Latest results (before merge) are available here: https://www.joelverhagen.com/blog/2020/12/fastest-net-csv-parsers

Looks like the unit test is failing. Could you take a look?

joelverhagen commented 2 years ago

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.22000
AMD Ryzen 9 3950X, 1 CPU, 32 logical and 16 physical cores
.NET SDK=6.0.100
  [Host]     : .NET 6.0.0 (6.0.21.52210), X64 RyuJIT
  Job-VNHKCQ : .NET 6.0.0 (6.0.21.52210), X64 RyuJIT

InvocationCount=1  IterationCount=6  LaunchCount=1  
UnrollFactor=1  WarmupCount=2  

BEFORE

Method Mean Error StdDev Gen 0 Gen 1 Gen 2 Allocated
CsvHelper 2.340 s 0.0300 s 0.0107 s 33000.0000 18000.0000 3000.0000 261 MB
RecordParser 1.781 s 0.0304 s 0.0108 s 33000.0000 18000.0000 3000.0000 261 MB
Sylvan_Data_Csv 1.370 s 0.0549 s 0.0196 s 33000.0000 18000.0000 3000.0000 261 MB

AFTER

Method Mean Error StdDev Gen 0 Gen 1 Gen 2 Allocated
CsvHelper 2.323 s 0.0267 s 0.0095 s 33000.0000 18000.0000 3000.0000 261 MB
RecordParser 1.300 s 0.0476 s 0.0124 s 37000.0000 19000.0000 3000.0000 381 MB
Sylvan_Data_Csv 1.339 s 0.0377 s 0.0098 s 33000.0000 18000.0000 3000.0000 261 MB
joelverhagen commented 2 years ago

Let me know if you want to merge as-is.

electricessence commented 2 years ago

Wow. Loving the competition here! This is such a cool project. Go RecordParser!

electricessence commented 2 years ago

@joelverhagen @leandromoh Isn't this an unfair comparison? Am I missing something? If you parallelize one lib, then you'll need to offer parallel equivalents to all libs. Some will do better than others at parallel processing. The implementation functions should be as close to identical as possible to ensure a true apples to apples comparison. That doesn't stop anyone from adding some queuing or partitioning under the hood but in general isn't this benchmark really asking a simple question? "Which is faster at synchronous processing?"

Thoughts?

leandromoh commented 2 years ago

@joelverhagen @leandromoh Isn't this an unfair comparison? Am I missing something? If you parallelize one lib, then you'll need to offer parallel equivalents to all libs. Some will do better than others at parallel processing.

Not necessarily, Usually this kind of libraries process files in a sequential manner in a single thread. They dont have the feature of be thread-safe, and may not able to parse in parallel. So you can't get the "parallel equivalents to all libs" because it may not exist, but of course someone can open a PR for some library to run in parallel if the feature is an option.

The implementation functions should be as close to identical as possible to ensure a true apples to apples comparison. That doesn't stop anyone from adding some queuing or partitioning under the hood but in general isn't this benchmark really asking a simple question? "Which is faster at synchronous processing?"

Thoughts?

In the beginning of the post is said that the benchamark has "Specific purpose tested: My goal was to find the fastest low-level CSV parser.". Seems me that we don't care how things are done, but how fast things are done.

leandromoh commented 2 years ago

Let me know if you want to merge as-is.

I did a new commit #7d899d0. I imagine that its result may be a bit faster than previous.

electricessence commented 2 years ago

Looking at the 'implementation' code in the commit I find it overly complex compared to the others. You might as well change the string.split implementation as well to facilitate paralleling the split operation. I have a variety of async and even parallel friendly methods but I haven't asked to change the implementation because I think it defeats the purpose of the benchmark.

I challenge this implementation as a quoted string with a newline character is an outlier that easily breaks line segmented processing. I rewrote my entire lib to handle these cases and when you can't rely on a newline character as a hard segmentation it's nearly impossible to parallelize parsing. You can however process the string to type conversion in other threads which can definitely improve the real world result.

All that said, if implementing the parser is multiples more lines of code than just using a straight synchronous method, then I don't think that implementation is valid for these benchmarks. I would ask @leandromoh to provide a separate method that handles the partitioning/etc so that the implementation is as simple as any other. Using multiple threads to speed things up is totally ok, but if the consumer has to do more work to make it happen, it doesn't seem appropriate here.

electricessence commented 2 years ago

Sorry, I have a hard time not seeing this as straight up cheating. https://github.com/joelverhagen/NCsvPerf/pull/45/commits/1fa86d43d826bac80b504448ba25efe3148148a5 If you're gonna do this, you should be also doing it for string.split as the baseline. @leandromoh Does your parallel loop routine work with quoted strings? Again, I'd be totally happy if the parallel implementation existed in your lib and not in the implementation code.

joelverhagen commented 2 years ago

Glad to see we have a lively discussion here 😃. Let's do our best to remain objective, civil, and perhaps try to clarify any useful goals that should come of this project.

I see both perspectives here. On the one-hand, parallelism is an outlier because (as far as I know) the other libraries haven't used it yet. Perhaps this is an indication of an opportunity for other libraries as opposed to a cheat from RecordParser. In general, I think parallelism seems like a fair tool to employ just like SIMD was novel and fair for Sylvan, and spans were fair for many of the first fast ones. I never explicitly thought of single-threaded as a requirement. Maybe it is? Maybe not? Not sure at this time 😄.

That being said, I do think there are some problems with this PR that @electricessence has alluded to or I've come to after reading this thread and thinking.

  1. I think the main problem is that there is too much code in this adapter. If the parallel feature was implemented in a clean, encapsulated way in a NuGet package, other folks could actually benefit in their projects. In short, I'll only move forward with this PR if the "tricks" are moved into the library and this adapter for the test suite is small, like the other ones. Example: Sylvan_Data_Csv.cs.
    • Any library that's in first place will be the what folks will try if they read the blog post so it should be easy for them to reproduce the same great wins without copying too much code. @leandromoh, I hope you understand. Would it be too hard to put in the library and even throw the behavior behind a parallel == true or threadCount > 1 option?
  2. The .Result (sync of async) line makes me worried. Could this cause problems in some applications? I could be totally off here but wanted to hear what you think.

For context, both my employment (NuGet.org team lead) and personal interest (my blog/side projects) align with making .NET developers more productive. I think @electricessence brings up some great points that we should consider, @leandromoh.

Using multiple threads to speed things up is totally ok, but if the consumer has to do more work to make it happen, it doesn't seem appropriate here.

@electricessence -- do you know if BenchmarkDotNet has a way to reliably quantify this? If what you're saying is true, we should be able to measure total CPU time (not just wall clock time) and see a difference. I wonder if there is a carbon footprint measure 😉.

I'll think more about this. I'm wondering what others think about this discussion. Maybe a succinct way of putting it is this:

Do you expect a CSV parsing library to be multi-threaded or to even have that option?

Thanks for the interest in the project!

electricessence commented 2 years ago

@joelverhagen Love your response. I hope it's clear that I'm all for leveraging multiple threads or whatever is necessary to win in a performance competition. I just don't think what makes a winner should exist in this project. It should exist in each competitors' code.

do you know if BenchmarkDotNet has a way to reliably quantify this?

I have no idea.

Thanks for the interest in the project!

Love it.

Do you expect a CSV parsing library to be multi-threaded or to even have that option?

Paralleling this type of parsing isn't easy. I suggest we have a separate set of benchmarks for it. And as I said earlier, you should set up a parallel implementation for string.split as a baseline. Also, as I have in another ticket, we should have tests that compare performance of more difficult data, like:

I still think it's fair to only compare string parsing performance. But it might be impressive to have another competition that includes parsers like Sylvan that are incredibly fast at providing typed results.

airbreather commented 2 years ago

Do you expect a CSV parsing library to be multi-threaded or to even have that option?

Responding to just this question, I certainly would be interested in seeing side-by-side benchmarks, if some library can take advantage of multiple CPU cores to improve throughput for a single operation. In theory, there absolutely can be ways to leverage multiple parallel threads, especially in a NUMA situation with a large enough file, especially if you can assume RFC 4180 rules when looking for split points through the file, especially if users can accept records coming in out-of-order.

That said, throughput for a single operation isn't everything, and I think there's still a place for a benchmark that compares total CPU time taken to transform a single CSV file into something usable.

electricessence commented 2 years ago

One more note:

Benchmarking async/parallel processing is magnitudes more difficult than synchronous. There are way more factors to consider as contention is a monkey wrench where a benchmark might perform amazing, but real-world might be terrible, or vice-versa. I've done quite a bit of benchmarking around Streams and StringReaders lately that the benchmarks (BenchmarkDotNet) were terrible compared to their synchronous counterparts, but when I ran my own benchmars (more real-world like) the asynchronous/parallel versions left the synchronous ones in the dust.

In summary, benchmarking parallel/async can be misleading.

MarkPflug commented 2 years ago

I think the primary issue I have with this implementation is that I don't think it handles CSV fully. I don't think it would correctly parse quoted fields containing newlines. Perhaps I'm wrong about that though, I didn't study it that closely. This project has never included any measure of correctness, and I think there are at least a couple libraries currently included don't fully handle the standard CSV spec. I also think that the implementation needs to be moved into the library to make it realistic

I'm not entirely convinced that CSV can be trivially parsed in parallel. The interpretation of every newline character and delimiter is dependent on every previous character in the file. That's not to say that it can't be done optimistically, with the assumption that the file doesn't contain quoted newlines, which are indeed uncommon in practice. However, this code doesn't do that.

@leandromoh submitted a parallel CSV writing implementation to the benchmark set that I maintain. I accepted the PR, but commented that it felt unrealistic, because (much like this PR) the entire "implementation" is external to his library. Nobody would want to use a library that required them to do that. I also felt that it was important to highlight, that the extra speed comes at the cost of using more CPU cores, and I thought it would be fair to try to reflect that in the results. There is a distinction between speed and efficiency, and so I implemented a "CpuDiagnoser" for BDN, which shows that while the parallel approach is twice as fast, it comes at the cost of 2x overall CPU utilization. Perhaps that's a tradeoff that someone would be willing to make. I should note that the CpuDiagnoser only works for InProc runs, and probably isn't the most reliable measurement.

If leandromoh can implement a correct multithreaded implementation, and it is encapsulated within his library so that it is realistic to expect someone to use it, then sure, accept the PR. But I have strong feelings about an incorrect parser taking the top spot. Correctness should always take priority over performance. Unfortunately, the current test data does nothing to test the completeness/correctness of the implementations.

airbreather commented 2 years ago

I'm not entirely convinced that CSV can be trivially parsed in parallel. The interpretation of every newline character and delimiter is dependent on every previous character in the file. That's not to say that it can't be done optimistically, with the assumption that the file doesn't contain quoted newlines, which are indeed uncommon in practice.

Well, a CSV file can just as easily be parsed backwards as forwards, assuming that it conforms to RFC 4180 (probably even if it doesn't, I just haven't thought about all the edge cases yet). So — assuming you have a big enough file with enough records for this to be worthwhile and that you aren't already bottlenecked by I/O or memory bandwidth — you can get roughly a 2x speedup by having one thread start at the beginning and one thread start at the end.

Synchronization isn't even all that much of a challenge for this approach, since you can have two producers inserting into a producer/consumer queue, and have one consumer that marks the queue complete once it dequeues a record that was already dequeued from the other direction (which is relatively straightforward to detect if you also keep metadata with each record that says which direction it came from and how far through the file it was).

It's probably counterproductive to try to do anything fancier than that for a general-purpose library. The format is actively hostile to splitting a file at an arbitrary byte offset and trying to do anything useful from there. The best you could probably do would be to try parsing each section both ways (one if there's a quoted field leading into it, one if there's not), and then stitching together the results from the sections.

Most likely, in the situations where you are remotely interested in this, you can make some simplifying assumptions about the file format that can help you do much better than a general-purpose routine anyway, and so my intuition is that there is not much of a point in pursuing this further than a thought experiment (but I would be interested to hear if someone can prove me wrong).

MarkPflug commented 2 years ago

The format is actively hostile to splitting a file at an arbitrary byte offset

That's what I meant by trivially parallelizable. You can't find a newline in the middle of a file and start parsing from there assuming it is a new record. This prevents splitting the file into N segments for parallelization. Your idea of parsing backwards is interesting (hadn't considered that), but also not what I'd consider trivial.

leandromoh commented 2 years ago

Perhaps this (parallelism) is an indication of an opportunity for other libraries as opposed to a cheat from RecordParser. In general, I think parallelism seems like a fair tool to employ just like SIMD was novel and fair for Sylvan, and spans were fair for many of the first fast ones.

I completely agree

Any library that's in first place will be the what folks will try if they read the blog post so it should be easy for them to reproduce the same great wins without copying too much code. @leandromoh

I agree, we should provide not just fast code but also clean APIs for users and hide all the complexity from them. As I said, eventually all this code, as well all methods support, will be integreted in the library. Thus I didnt find it would be a trouble to add it in the benchmark first.

Would it be too hard to put in the library and even throw the behavior behind a parallel == true or threadCount > 1 option?

No, it is not hard at all. Maybe I do it for next release.

I'll only move forward with this PR if the "tricks" are moved into the library and this adapter for the test suite is small

No problem, I going to close this PR and soon as the adapter code is inside the library I open another one!

Thanks everyone for all insights!

jzabroski commented 2 years ago

Part of the reason this project doesn't assess correctness and full compliance is that Microsoft has comically implemented many CSV parsers all inferior to open source ones and refuses to acknowledge it.