h2oai / db-benchmark

reproducible benchmark of database-like ops
https://h2oai.github.io/db-benchmark
Mozilla Public License 2.0
322 stars 85 forks source link

Provide types explicitly when parsing CSV for juliadf #119

Closed nalimilan closed 4 years ago

nalimilan commented 4 years ago

CSV.jl should be able to use less memory when types are provided explicitly, which could avoid the memory error with 50GB datasets.

See also https://github.com/h2oai/db-benchmark/pull/85 and https://github.com/JuliaData/CSV.jl/issues/432#issuecomment-541965272.

jangorecki commented 4 years ago

@nalimilan looks like it didn't help much, anyway good to have that in place. Another option is to use a binary format. It has to be stable over time. It also has to be portable because I won't be able to produce binary files on a machine running benchmark (125GB) because of OOM in read CSV. I need to copy csv to a bigger machine (255GB), read it and produce binary files from there. AFAIR when I looked at serialization in julia last time I haven't seen feasible solution.

julia 1.0.2 DataFrames 0.19.4 CSV 0.5.17


G1_1e9_1e1_0_0 - killed OS OOM

/bin/bash: line 1:  6129 Killed                  ./juliadf/groupby-juliadf.jl > out/run_juliadf_groupby_G1_1e9_1e1_0_0.out 2> out/run_juliadf_groupby_G1_1e9_1e1_0_0.err

G1_1e9_1e2_0_[0|1]

ERROR: LoadError: OutOfMemoryError()
Stacktrace:
 [1] Type at ./boot.jl:394 [inlined]
 [2] copy(::CSV.Column{Float64,Float64}) at /home/jan/.julia/packages/CSV/yJFAJ/
src/tables.jl:16
 [3] (::getfield(DataFrames, Symbol("##DataFrame#91#94")))(::Bool, ::Type, ::Arr
ay{AbstractArray{T,1} where T,1}, ::DataFrames.Index) at /home/jan/.julia/packag
es/DataFrames/yH0f6/src/dataframe/dataframe.jl:130
...

G1_1e9_2e0_0_0 - timeout