joelverhagen / NCsvPerf

A test bench for various .NET CSV parsing libraries
https://www.joelverhagen.com/blog/2020/12/fastest-net-csv-parsers
MIT License
71 stars 14 forks source link

Increase string pool limit #12

Closed MarkPflug closed 3 years ago

MarkPflug commented 3 years ago

Wow, lots of action in the CSV Battle Royale recently!

Looks like Cursively took the crown by using my own tricks against me. I think that I can take it back by increasing the string pool limit... at least it does on my machine. I had landed on 64 just by eye-balling the contents of the test csv, but it turns out that there are some larger strings in there too.

Are you regretting putting these benchmarks together yet?! I'm sure its a pain to be rerunning these and updating the blog, so don't bother unless/until some other interesting contenders appear.

joelverhagen commented 3 years ago

Are you regretting putting these benchmarks together yet?! I'm sure its a pain to be rerunning these and updating the blog, so don't bother unless/until some other interesting contenders appear.

🤣 a bit, but I've learned a lot by looking at the implementations. It's a bit of Cunningham's Law going, in a way.

I think I made a mistake generating the big test CSV by duplicating a small data set. You, Cursively, and CsvHelper (as of https://github.com/joelverhagen/NCsvPerf/pull/8) have really taken advantage of that -- understandable. I have a real 1 million line file lying around that still has duplication, but certainly less. Maybe I'll swap that in for the next iteration.

MarkPflug commented 3 years ago

have really taken advantage of that

Yeah, I felt a little guilty when I used pooling to take the lead, it definitely benefits greatly from how you construct the data set. In a typical, real data set you'd be more likely to enable pooling to reduce the memory usage, but the duration would likely increase slightly. Of course, it all depends on just how much duplication there is in the file. CSVs do tend to be very repetitive by their nature.

I think you'd mentioned in another thread that you thought a data mapping benchmark would be good, and I agree. Your particular data set appears to be mostly strings, dates, versions and guids. Those could certainly be mapped to strongly typed members rather than just processing them all as strings. I know that not all of the current set of libraries support anything beyond just strings. mgholam, for example, as fast as it is still would have to process everything via string where other libraries (like mine) have the ability to read strongly typed values directly out of the internal character buffer. And CsvHelper would shine in that scenario because data binding/mapping is a huge part of what that library offers.

I had to look up Cunninghams' law, I hadn't heard of it before, and yes, I think you may have fallen into its trap. Funny thing, I actually met Ward Cunningham once a few years back. Nice guy.

joelverhagen commented 3 years ago

Thanks Mark!