joelverhagen / NCsvPerf

A test bench for various .NET CSV parsing libraries
https://www.joelverhagen.com/blog/2020/12/fastest-net-csv-parsers
MIT License
69 stars 14 forks source link

Add Cursively to the benchmark suite. #7

Closed airbreather closed 3 years ago

airbreather commented 3 years ago

My Cursively library is a lot different from most others. Rather than work on a stream of UTF-16 encoded char values, it works directly on the input bytes, leaving it up to the caller to choose how to process the individual fields.

It produces its output in an unusual way as well: rather than something like IEnumerable<string> (or similar), it requires the caller to provide a series of callbacks (exposed as a visitor-pattern implementation) that accept slices of the original input.

As a result, Cursively has the potential to be the fastest CSV implementation that I've seen, by far, if you are willing to spend the extra effort to integrate it with your application, and you don't need too much of your data to be UTF-16 encoded (since you'd have to transcode the individual fields one-by-one, which is not very CPU cache friendly).

Since practically every member in this benchmark must be fully hydrated as separate UTF-16 string instances, for unknown reasons, I had originally just written a couple of comments on the Reddit thread where your post was linked, and then gave it a rest...

...until @MarkPflug pinged me a while later with his own benchmark suite and tipped me off that the interesting benchmarks in yours actually have enough duplicated strings that a string pool makes a big difference.

So I went ahead and tweaked his string pool to work with a UTF-8 source (and to use XXH64 for hashing, since I had an implementation lying around from when I got bored one day), and now I'm getting some pretty good timings with Cursively, even though "transcode everything to individual UTF-16 string instances" is roughly the worst-case scenario for Cursively when it's compared against solutions that are optimized to assume that this is what you need.

airbreather commented 3 years ago

Side note, it would be interesting to see how everything stacks up when you add the [GcServer(true)] attribute to PackageAssetsSuite... in theory, solutions that care about the throughput like this will also be perfectly willing to tune the GC to maximize throughput, which this tends to do in most cases.

joelverhagen commented 3 years ago

Haha, I didn't even realize there was a Reddit post! Great discussion, especially between you and @MarkPflug.

BTW, your implementation is awesome! I'm curious if xxHash is really the best for hash coding. Have you compared the performance to .NET newish HashCode helper struct? I agree with your point that there's good SIMD potential that could allow a custom hash code method to beat HashCode.

I agree that string[] is imperfect for a CSV line abstraction, especially a low-level, manual data mapping thing like what I am doing. Also, I agree a ReadOnlySpan<byte> is a safer assumption for a parser residing at the bottom of the stack, just above the byte stream. Why assume string if the user is just writing those bytes back somewhere else? Or going to another property type besides string? I was talking to one of my coworkers and we were coming to a very similar conclusion (/cc @loic-sharma).

The main benefit of string[] is that is really easy to understand and build on top of. I have to say a visitor pattern and writing some accumulation code necessitated by your VisitPartialFieldContents is a tall order for most use cases. (Apologies if I am misunderstanding). Maybe it could be possible to use IReadOnlyList<ReadOnlySpan<byte>> instead of string[], conceding that we must hold the whole line in memory? Should be doable if backed by an expanding buffer similar to MemoryStream but I haven't tried or thought too deeply yet.

I have been thinking about adding a data mapping benchmark with the same PackageAsset model -- essentially let each library do its own thing to saturate and even activate each "record" instance instead of forcing a string[]. This is maybe closer to a programming model that most people want to use with CSVs (as shown by the success of CsvHelper). I think @MarkPflug has benchmarks for data mapping... maybe we should put our heads together 😄.

joelverhagen commented 3 years ago

Regarding [GcServer(true)], I think it's an interesting test, but I think it should be a separate benchmark/variant. People may not want to tune in this way (depends on their application) so know both "workstation GC" and "server GC" are interesting data points.

jzabroski commented 3 years ago

I'm curious if xxHash is really the best for hash coding. Have you compared the performance to .NET newish HashCode helper struct?

HashCode.Combine is xxHash32. The linked PR has a really good discussion on the intent of the interface as well, such as why static int Combine<T1>(T1 value1) exists.

Unrelated, I have a nice list of hashing functions for .NET listed in my .NET library : https://github.com/jzabroski/Home/#non-cryptographic-hashing-functions

jzabroski commented 3 years ago

Also, it's a bit off-topic, but I have been meaning to add an answer to https://stackoverflow.com/questions/102742/why-is-397-used-for-resharper-gethashcode-override/

airbreather commented 3 years ago

I'm curious if xxHash is really the best for hash coding. Have you compared the performance to .NET newish HashCode helper struct?

I went with XXH64 just because I had a port of it lying around. I think XXH3 has properties that would make it favorable in this particular data set (particularly, it is supposedly much faster at dealing with short keys, which this has a lot of), I just didn't want to bother with all the work to port it, for what would likely be only a marginal improvement.

If you want to get really lost in this, check out this blog post written by the author of XXH3 and of, I think, XXH32 and XXH64.

I have to say a visitor pattern and writing some accumulation code necessitated by your VisitPartialFieldContents is a tall order for most use cases. (Apologies if I am misunderstanding).

airbreather/Cursively#21 / airbreather/Cursively#22 are there to layer more conventional ways of consuming the data on top of this, but it's not a high priority for me because it's actually a lot of work to flip the loop on its head, and I think there are plenty of other libraries that would fit better if that's how you wanted the data anyway.

The reason to do that layering would mainly be to just offer a way to use a single library for all your CSV processing instead of having to choose Cursively for cases where runtime efficiency matters more than developer productivity, and some other library for cases where that's reversed. That said, .NET itself is filled with "start with this productive approach, drop down to more efficient code as needed" APIs, so it's not out of the question.

If the issue is just the accumulating from VisitPartialFieldContents, then I could write some intermediate visitor class that does this accumulation for you. The reason why it's like this at the lowest level is to provide the greatest level of flexibility I can think of: it supports fields over 2 GiB, it gives you a way to gracefully stop early before a malicious input file pushes you into an OutOfMemoryException from trying to expand an accumulator buffer too big, it lets you use pointer math to get offsets from the beginning of the entire memory-mapped file, and I'm sure other stuff that I haven't even thought of that only works if you strictly get slices of the original input span(s).

Maybe it could be possible to use IReadOnlyList<ReadOnlySpan<byte>> instead of string[], conceding that we must hold the whole line in memory? Should be doable if backed by an expanding buffer similar to MemoryStream but I haven't tried or thought too deeply yet.

The specific idea of IReadOnlyList<ReadOnlySpan<byte>> can't work, because ReadOnlySpan<byte> is a ref struct, but it's fairly straightforward to create intermediate visitor base classes that transform the trio of VisitPartialFieldContents(ReadOnlySpan<byte>) / VisitEndOfField(ReadOnlySpan<byte>) / VisitEndOfRecord() into something else.

For example, we could have a visitor base class where all you have to implement is a method like:

protected override void VisitRecord(ReadOnlySpan<byte> fullLineData, ReadOnlySpan<int> firstIndex)
{
    // firstIndex[n] is the offset of the nth field's first byte in fullLineData
    // there's one extra slot at the end that always holds the value of fullLineData.Length.
    for (int i = 0; i < firstIndex.Length - 1; i++)
    {
        ReadOnlySpan<byte> field = fullLineData[firstIndex[i]..firstIndex[i + 1]];
        DoStuffWith(field);
    }
}

To completely get away from the users having to implement a visitor, I think it would be reasonably doable to write something that converts this to IObservable<T>, since that's a more standard way to model this concept of inverted loop ownership. I actually was going to do that for Cursively at first, but the visitor-pattern approach had a better story for incrementally layering intermediate transformations (CsvReaderVisitorWithUTF8HeadersBase being one example of that), and the most obvious choices for T are all difficult to add stuff to without breaking callers.

In any event, from IObservable<T>, one could use the services from Rx.NET to manipulate it further, one such manipulation being to convert it into IEnumerable<T>, probably not worse than what I would wind up writing on my own.

I have been thinking about adding a data mapping benchmark with the same PackageAsset model -- essentially let each library do its own thing to saturate and even activate each "record" instance instead of forcing a string[].

Anything that doesn't strictly require UTF-16 is always a plus for Cursively. Of course, the majority of the .NET ecosystem is built around using UTF-16 for strings, and although there's been lots of great progress towards making UTF-8 more than just "that thing that we have to keep converting to and from when we interact with the rest of the world", UTF-8 still feels like a second-class citizen.