kevin-montrose / Cesil

Modern CSV (De)Serializer
MIT License
65 stars 3 forks source link

Feature Suggestions #1

Open Bio2hazard opened 4 years ago

Bio2hazard commented 4 years ago

Hi Kevin, I loved your work on Jil so I am quite delighted to find that you are working on a high performance Csv Parser right when I was looking for one!

Testing it out, I ran into a few limitations for my ( albeit very edge case ) usage. ( I need to parse a PB of access logs 🚀 )

  1. Provide an option for discarding un-mapped columns. The csv file I am parsing has 2 columns at the end that are just repeats of previous data, but I had to map them out anyway since the serializer would otherwise throw an exception.

  2. This is very edge-case, but I would love to see some very low-level extensibility points, for example a way to get access to the raw ReadOnlySpan<byte> of a given field.

  3. A way to just execute a function/delegate/callback for each field right after it has been processed into the correct type, without the expectation of an output or return type. For example, if column 3 of my csv file is an int, I'd love to be able to just provide a custom action that aggregates the values, without Cesil expecting me to return a specific type that represents each row.

  4. Again, very edge case, but you could consider adding an optional string interning/reusing feature that calculates the cardinality of string fields based on the first N records, and if the cardinality is below a certain threshold, re-use an already allocated string. I'm currently using an implementation with dictionary<int,string> that's keyed on the GetHashCode() of the string, this worked well and resulted in moving a lot of Gen 3 collections to Gen 1, as well as reducing the minimum required memory. On monday I will experiment with hashing off the raw bytes to avoid the string allocation overhead to begin with.

Bio2hazard commented 4 years ago

As a bonus, here are some performance values for parsing a rather chunky csv file ( ~300 mb decompressed ):

BenchmarkDotNet=v0.12.0, OS=ubuntu 18.04 Intel Xeon Platinum 8124M CPU 3.00GHz, 1 CPU, 2 logical cores and 1 physical core .NET Core SDK=3.1.100 [Host] : .NET Core 3.1.0 (CoreCLR 4.700.19.56402, CoreFX 4.700.19.56404), X64 RyuJIT Job-ZANPVT : .NET Core 3.1.0 (CoreCLR 4.700.19.56402, CoreFX 4.700.19.56404), X64 RyuJIT

Force=True Server=True

Method Mean Error StdDev Gen 0 Gen 1 Gen 2 Allocated
CsvHelperReadAsClass 11.527 s 0.0840 s 0.0786 s 13000.0000 5000.0000 2000.0000 3916.21 MB
CsvReadByClass 9.944 s 0.1922 s 0.2057 s 15000.0000 4000.0000 1000.0000 5349.07 MB
CsvHelperReadByColumn 9.863 s 0.0652 s 0.0610 s 20000.0000 7000.0000 3000.0000 3773.75 MB
CesilReadAsClass 6.563 s 0.0251 s 0.0235 s 2000.0000 1000.0000 - 824.72 MB
CursivelyReadByColumn 3.043 s 0.0049 s 0.0046 s 3000.0000 1000.0000 - 1074.77 MB

Csv in this case is TinyCsvParser. The performance of Cesil is already very impressive, memory usage especially is amazingly low! Keep up the great work!

kevin-montrose commented 4 years ago

Cesil remains pre-release and subject to considerable reworking, but as of the latest pre-1.0 NuGet package...

  1. Provide an option for discarding un-mapped columns. The csv file I am parsing has 2 columns at the end that are just repeats of previous data, but I had to map them out anyway since the serializer would otherwise throw an exception.

This is now default behavior. Additionally, dynamic deserialization allows excess columns to be accessed via index if they are present. IgnoreExcessColumns(Async) and AllowExcessColumns(Async) track this support.

  1. This is very edge-case, but I would love to see some very low-level extensibility points, for example a way to get access to the raw ReadOnlySpan of a given field.

I don't consider this an edge-case, a modern .NET serialization library should embrace the low cost extensions afforded by ReadOnlySpan<T> (and friends). The way to do this with Cesil is to provide a custom Parser via the ITypeDescriber on your Options.

Parsers can be chained, so if you just want to inspect (but never act on) the data you can create a Parser that always returns false and then chain to whichever Parser you want to do the actual work.

  1. A way to just execute a function/delegate/callback for each field right after it has been processed into the correct type, without the expectation of an output or return type. For example, if column 3 of my csv file is an int, I'd love to be able to just provide a custom action that aggregates the values, without Cesil expecting me to return a specific type that represents each row.

I'll have to think on a decent API for this, and whether it belongs in Cesil. You can kind of jury rig it up using either dynamic or an aggregate row... something like:

public static void Aggregate<T1,T2,T3,...>(Options opts, TextReader reader, Action<T1> a, Action<T2> b, Action<T3> c, ...)
{
  var row = new AggregateRow<T1,T2,T3,...>();
  var config = Configuration.For<AggregateRow<T1,T2,T3,...>>(opts);
  using(var csv = config.CreateReader(reader))
  {
    while(csv.TryRead(ref row))
    {
       a(row.A);
       b(row.B);
       c(row.C);
       ...
    }
  }
}

private sealed class AggregateRow<T1,T2,T3,...>
{
  public T1 A { get; set; }
  public T2 B { get; set; }
  public T3 C { get; set; }
  ...
}

Might make sense to add something like that to CesilUtils.

  1. Again, very edge case, but you could consider adding an optional string interning/reusing feature that calculates the cardinality of string fields based on the first N records, and if the cardinality is below a certain threshold, re-use an already allocated string. I'm currently using an implementation with dictionary<int,string> that's keyed on the GetHashCode() of the string, this worked well and resulted in moving a lot of Gen 3 collections to Gen 1, as well as reducing the minimum required memory. On monday I will experiment with hashing off the raw bytes to avoid the string allocation overhead to begin with.

In the past I've explored things like this for other serialization libraries, and have found it requires a lot of fine tuning to a specific use case. Cesil enables providing your own interning by specifying a Parser for string members, which can do whatever it wants to avoid or reuse allocations.

It's also worth noting that .NET might get some automatic support for string de-duping (at least in older generations), which may change the calculus on whether manually de-duping in app code is worth it.

kevin-montrose commented 4 years ago

Oh, and I've got some "real" benchmarks (previously I had a crazy one that could only ever run on my personal machine) in the repo now - so Cesil may be a tad faster in 0.3.0 too.

The point of Cesil is to be pretty flexible and extensible and "modern," not to be the fastest, so I'm not planning to trade flexibility for performance. That said, a properly designed .NET library in 2020 ought to be quite fast by default.

lstefano71 commented 3 years ago
  1. This is very edge-case, but I would love to see some very low-level extensibility points, for example a way to get access to the raw ReadOnlySpan of a given field.

I don't consider this an edge-case, a modern .NET serialization library should embrace the low cost extensions afforded by ReadOnlySpan<T> (and friends). The way to do this with Cesil is to provide a custom Parser via the ITypeDescriber on your Options.

Would you happen to have a sample program which shows how one would go about providing a customer parser? I tried looking through the tests but apart from the sheer quantity, as far as I could tell, they are low-level tests and go all the way up to the reader. Is one supposed to implement a new TypeDeiscriber? Or is it possible to simply extend the Default one? Thank you in advance.