h2oai / db-benchmark

reproducible benchmark of database-like ops
https://h2oai.github.io/db-benchmark
Mozilla Public License 2.0
320 stars 85 forks source link

avoid using multiple threads in CSV parsing #211

Closed bkamins closed 3 years ago

bkamins commented 3 years ago

we do not measure CSV.jl parsing performance in this benchmark so disable using multiple threads for CSV reading in Julia tests.

jangorecki commented 3 years ago

It still can be useful because each script has defined timeout in _control/timeout.csv. So in case if reading csv will take to much time then script can be terminated later on because of reaching timeout.

bkamins commented 3 years ago

In the past (i.e. before the last benchmark) we used single thread always and we did not hit these timeouts - right?

jangorecki commented 3 years ago

Long time ago it was hitting the limits but I don't think now it will be problem.

jangorecki commented 3 years ago

I run interactively using 20 cores

# before change proposed in this PR
52.193861 seconds (710.69 k allocations: 3.925 GiB, 81.14% gc time, 0.88% compilation time)
44.617192 seconds (420 allocations: 3.886 GiB, 86.10% gc time)
# after
19.729487 seconds (103.54 k allocations: 3.989 GiB, 73.19% gc time, 0.07% compilation time)
35.267162 seconds (415 allocations: 5.253 GiB, 80.55% gc time)

now running full benchmark

bkamins commented 3 years ago

Thank you! This is what I expected, i.e. the GC issue is not resolved, but using a single thread for CSV reading lessens the problem (@quinnj: what @jangorecki reports is exactly the same issue with multi-threading that I have reported to you)

Additionally: if we resolve the GC issue the run-time of this query should be around 7 seconds which is around what I get on a machine with enough RAM (and this would get us within a reasonable range in comparison to other packages).