Closed KPixel closed 4 years ago
It was the test project loading the 4million row CSV (I don't have the code but you can easily write it again).
The numbers are for my PC, yours will vary but the differences are probably the same.
Let me know how the fastCSV
test runs on your machine.
I was asking for your test to understand how you went about it.
I have some as well, but they are probably quite different. For example: Since the goal is to test the parser, I would first load the whole CSV in memory (to remove the time taken to read from disk).
I also wasn't doing any "transformations" (like turning strings into integers) after reading... Edit: That means, your ToOBJ would return false (to avoid out-of-scope allocations).
You can exclude the conversions if you like (they are the same in any case), reading into memory first is probably not a real world example but you could do that also. Excluding all these there is very little left to benchmark.
My reasoning is that this is a "micro-benchmark". So, excluding all external factors is the way to go. When people want to see how a library works in their specific scenario, they will include these external factors accordingly.
Here is a great example of why that is important: In your test, you write: "nreco.csv : 19.05s 800Mb used".
But I have tested NReco, and confirmed that it does no allocation (outside of a small buffer). So, it is possible to read millions of rows with NReco and only allocate a few megabytes of RAM. However, your test gives the impression that NReco is allocating these 800 Mb.
Well if you don't use (keep the row in a list) then no memory is used by any library (fastCSV
runs with 5mb in the task manager).
What numbers are you getting?
My point regarding memory usage is that if a test using NReco is allocating 800 Mb, then you can't compare that with another test using fastCSV and allocating 639 Mb. Since NReco is not allocating that memory itself.
To run the test on my machine, I have downloaded this repo, added the code for NReco CvsReader in the fastCSV project, and changed the csproj to run it in real Release mode.
Here is the actual test for NReco (I kept it as close as possible to the one for fastCSV):
var list = new System.Collections.Generic.List<LocalWeatherData>();
using (var text = File.OpenText(path))
{
var c = new NReco.Csv.CsvReader(text);
c.Read(); // Skip the header
while (c.Read())
{
var o = new LocalWeatherData();
bool add = true;
line++;
o.WBAN = c[0];
o.Date = new DateTime(fastCSV.ToInt(c[1], 0, 4),
fastCSV.ToInt(c[1], 4, 2),
fastCSV.ToInt(c[1], 6, 2));
o.SkyCondition = c[4];
//if (o.Date.Day % 2 == 0)
// add = false;
if (add)
list.Add(o);
}
}
And here are the results on my machine:
Nice! You must have a faster newer machine.
Can you please run the test I just wrote on your machine and share the results?
My apologies, you are right! NReco on my system does it in 7.41 sec
Back to the drawing board!
Cool. I'm looking forward to the updates.
By the way, another improvement you could make to your API is the ability to stop reading the CSV at any point. For example, a user could just want to extract one specific line, and once that line is found, we can stop.
Will do, thanks.
Woo Hoo!
Got it down to 6.2 sec (750Mb used a little more than the previous version) on .net 4 .net core does it in 6.4sec with 669Mb used
Congrats, fastCSV now takes 4.1 sec on this test on my machine.
By the way, do you want to support the case where the number of columns varies per row? Currently, I get an exception here when that happens.
Hmmm, generally you can have less columns than the header, I will look into it.
Try playing with the buffer size on your machine to see if the times change (line 127 currently 64*1024)
Less, sure. The exception is when it is more.
Like:
a,b,c
0,1,2,3
Hi,
Would you please share the code you used to benchmark all the CSV parsers?
I did some research on them a few months ago, and NReco was a lot faster than the numbers in the README.md...