Closed bytefish closed 3 years ago
Wow, awesome approach! Just to make sure I understand correctly:
TextReader
/File.ReadLines
/string
is split into lines (one string
per line in source data)In other words, the only non-parallel work is splitting a block of bytes/string into lines prior to any CSV-specific work. The rest (the real "CSV parsing" step) is parallel.
Very cool! Nice work making your library support this approach of the box!
You cannot parse multi-line CSV data.
I think this could be mitigated by implementing a lightweight, custom line splitter that is aware of multiline CSV. Perhaps this could be done in a way that isn't much more expensive than however TextReader.ReadLine()
is implemented.
Yes! Could you do this with the other parsers? Maybe definitely yes!
I think this approach would most trivially be applied to parsers that don't maintain their own buffer and already operate line by line. If the parser takes bytes and emits a string[]
per line all in one function, it would be harder to have the "serial line splitting + parallel CSV parsing" approach.
In conclusion, thanks a bunch for bringing this to my attention. I think my blog post waved hands a bit concerning the pipeline that takes bytes and emits a materialized list of record objects. Perhaps a better wording would have focused solely on the CSV tokenization/parsing step. That was my intent since it allows clever parallelization, activation, mapping, etc. all built on top of a fast tokenization routine.
@joelverhagen Yes, I have never benchmarked if it really makes sense to do the tokenizing in parallel. If not, one could indeed make the Tokenizer Multi-Line aware and from there on do the mapping.
We can close this issue, because there is no issue. ✌️
Hi @joelverhagen!
Thanks for this post! Your results for TinyCsvParser are correct. The Line Tokenizer I have implemented is inefficient and I knew about this. Very cool, that it's indeed the slowest implementation in your benchmarks. 🥇
I think it is doing way too much allocations and has a weird approach. I really should have paid way more attention in lectures on Finite State Machines and Compilers!
So if you feel like, you are (very very) welcome to replace my implementation with your version and make a Pull Request to TinyCsvParser. You just need to plug it in here (the tokenizer maybe shared between threads, so better not share state):
So don't take any of this as some sort of criticism, your results are correct.
If this is so slow, what's the use case for the library then?
So you are saying you are reading large files and that means you probably have dozens of cores idling, when parsing a file sequentially. How long does a modern SSD take to read such a tiny file with a million lines? Maybe a millisecond? Now put some object mapping, conversions and validation on top of the other parsers and you'll see the overhead.
At some point you'll notice: Reading the file and tokenizing it isn't the bottleneck anymore.
TinyCsvParser uses PLINQ in its pipeline, so it's trivial to parallelize the whole thing and do the result mapping and whatever you want with the data in parallel. This of course has one severe drawback: You cannot parse multi-line CSV data. But if you are OK with that, cool.
And what's important here: You can easily switch the Tokenizer to a different implementation.
Switching the Line Tokenizer to a
string.Split
implementation yields something around4400 ms
on your test data, for tokenizing the line, converting some properties to aDateTime
and doing the object mapping. Is this unfair, because I am utilizing X cores and the other parsers don't? Yes! Could you do this with the other parsers? Maybe definitely yes!The
StringSplitTokenizer
yields:Using a custom RFC4180
CustomTokenizer
yields:Here is the full test:
And the MeasurementUtils: