Use pool=true when reading CSV file in Julia

h2oai / db-benchmark

reproducible benchmark of database-like ops

https://h2oai.github.io/db-benchmark

Mozilla Public License 2.0

323 stars 85 forks source link

Use pool=true when reading CSV file in Julia #85

Closed nalimilan closed 5 years ago

nalimilan commented 5 years ago

The pool argument replaces categorical, and doesn't suffer for performance problems with large number of unique values which forced us to use categorical=0.05.

Cc: @bkamins

bkamins commented 5 years ago

Nice. The only question is if we should use copycols=true here or not (in my preliminary benchmarks it is ~20% faster).

CC @quinnj

nalimilan commented 5 years ago

You mean it's faster for grouping benchmarks? I would expect all columns to be PooledArrays disregarding copycols. :-/

bkamins commented 5 years ago

Yes - I guess the problem is not with grouping columns but the columns on which you aggregate. I.e. it is faster to aggregate Vector{Float64} than CSV.Column{Float64,Float64} using e.g. sum.

nalimilan commented 5 years ago

Indeed. I wonder whether CSV.read should copy columns by default. That sounds simpler for users, and performance can be better with plain Vector.

quinnj commented 5 years ago

What we need to do is rewrite the grouping optimizations to use the proposed DataAPI.refarray and then things will be fast on CSV.Column{String, PooledString} and avoid extra allocations.

nalimilan commented 5 years ago

What @bkamins mentions is not related to pooled columns: it's that CSV.Column{Float64,Float64} is less efficient than Vector{Float64} for aggregations.

quinnj commented 5 years ago

Ah, @bkamins can you share an example or open an issue on CSV.jl w/ the perf difference; there might be some fine-tuning we can do to make it faster.

nalimilan commented 5 years ago

OK, let's use DataFrame(CSV.File(...)) for now, since this benchmark is about (possibly repeated) grouping operations, not CSV parsing. I've added a commit to do that. Now the question is whether the copying won't use too much memory for the VMs before the GC frees the buffer.

jangorecki commented 5 years ago

Thanks for PR, I can confirm that whole script execution is now faster.

bkamins commented 5 years ago

@jangorecki Is the 50GB out of memory error happening on CSV reading stage?

jangorecki commented 5 years ago

@bkamins yes, still it was not enough for 50GB. Whole script execution of 0.5 GB went from (110 to 80, 115-110, 80-65, 180-105 - depends on data). For 5 GB it is much better (1050-510, 630-410, 570-360, 2065-855). The biggest gain is for unbalanced data, k=2. Scripts for 50 GB now terminates after 20s instead of trying till timeout of 7200s, with the following:

ERROR: LoadError: SystemError: memory mapping failed: Cannot allocate memory

This saves 8h during each run of Julia (4 different datasets on 1e9 size).

nalimilan commented 4 years ago

@jangorecki With the new CSV.jl 0.5.14, memory use should be lower. Any chance you could try again with that version? Thanks!

bkamins commented 4 years ago

@quinnj - if you have some time to spare you could look at the part of data-loading code that was updated in this PR to check if you would have any memory-usage comments. Thank you!

nalimilan commented 4 years ago

Though I'd say let's see whether it works first. It would be nice to have things working without particular tricks.

jangorecki commented 4 years ago

@nalimilan sorry for lack of reply before. The most recent run of juliadf was using CSV 0.5.14. Will merge soon your PR and re-run using latest CSV.jl