Code for the benchmarks

KPixel commented 4 years ago

Hi,

Would you please share the code you used to benchmark all the CSV parsers?

I did some research on them a few months ago, and NReco was a lot faster than the numbers in the README.md...

mgholam commented 4 years ago

It was the test project loading the 4million row CSV (I don't have the code but you can easily write it again). The numbers are for my PC, yours will vary but the differences are probably the same. Let me know how the fastCSV test runs on your machine.

KPixel commented 4 years ago

I was asking for your test to understand how you went about it.

I have some as well, but they are probably quite different. For example: Since the goal is to test the parser, I would first load the whole CSV in memory (to remove the time taken to read from disk).

I also wasn't doing any "transformations" (like turning strings into integers) after reading... Edit: That means, your ToOBJ would return false (to avoid out-of-scope allocations).

mgholam commented 4 years ago

You can exclude the conversions if you like (they are the same in any case), reading into memory first is probably not a real world example but you could do that also. Excluding all these there is very little left to benchmark.

KPixel commented 4 years ago

My reasoning is that this is a "micro-benchmark". So, excluding all external factors is the way to go. When people want to see how a library works in their specific scenario, they will include these external factors accordingly.

Here is a great example of why that is important: In your test, you write: "nreco.csv : 19.05s 800Mb used".

But I have tested NReco, and confirmed that it does no allocation (outside of a small buffer). So, it is possible to read millions of rows with NReco and only allocate a few megabytes of RAM. However, your test gives the impression that NReco is allocating these 800 Mb.

mgholam commented 4 years ago

Well if you don't use (keep the row in a list) then no memory is used by any library (fastCSV runs with 5mb in the task manager).

mgholam commented 4 years ago

What numbers are you getting?

KPixel commented 4 years ago

My point regarding memory usage is that if a test using NReco is allocating 800 Mb, then you can't compare that with another test using fastCSV and allocating 639 Mb. Since NReco is not allocating that memory itself.

KPixel commented 4 years ago

To run the test on my machine, I have downloaded this repo, added the code for NReco CvsReader in the fastCSV project, and changed the csproj to run it in real Release mode.

Here is the actual test for NReco (I kept it as close as possible to the one for fastCSV):

var list = new System.Collections.Generic.List<LocalWeatherData>();
using (var text = File.OpenText(path))
{
    var c = new NReco.Csv.CsvReader(text);
    c.Read(); // Skip the header
    while (c.Read())
    {
        var o = new LocalWeatherData();
        bool add = true;
        line++;
        o.WBAN = c[0];
        o.Date = new DateTime(fastCSV.ToInt(c[1], 0, 4),
            fastCSV.ToInt(c[1], 4, 2),
            fastCSV.ToInt(c[1], 6, 2));
        o.SkyCondition = c[4];
        //if (o.Date.Day % 2 == 0)
        //    add = false;
        if (add)
            list.Add(o);
    }
}

And here are the results on my machine:

fastCSV: read 4496263 time : 7.8512245 sec
NReco: read 4496263 time : 5.2604746 sec

mgholam commented 4 years ago

Nice! You must have a faster newer machine.

KPixel commented 4 years ago

Can you please run the test I just wrote on your machine and share the results?

mgholam commented 4 years ago

My apologies, you are right! NReco on my system does it in 7.41 sec

Back to the drawing board!

KPixel commented 4 years ago

Cool. I'm looking forward to the updates.

By the way, another improvement you could make to your API is the ability to stop reading the CSV at any point. For example, a user could just want to extract one specific line, and once that line is found, we can stop.

mgholam commented 4 years ago

Will do, thanks.

mgholam commented 4 years ago

Woo Hoo!

Got it down to 6.2 sec (750Mb used a little more than the previous version) on .net 4 .net core does it in 6.4sec with 669Mb used

KPixel commented 4 years ago

Congrats, fastCSV now takes 4.1 sec on this test on my machine.

By the way, do you want to support the case where the number of columns varies per row? Currently, I get an exception here when that happens.

mgholam commented 4 years ago

Hmmm, generally you can have less columns than the header, I will look into it.

Try playing with the buffer size on your machine to see if the times change (line 127 currently 64*1024)

KPixel commented 4 years ago

Less, sure. The exception is when it is more.

Like:

a,b,c
0,1,2,3

mgholam / fastCSV

Code for the benchmarks #1