Closed nalimilan closed 4 years ago
@nalimilan looks like it didn't help much, anyway good to have that in place. Another option is to use a binary format. It has to be stable over time. It also has to be portable because I won't be able to produce binary files on a machine running benchmark (125GB) because of OOM in read CSV. I need to copy csv to a bigger machine (255GB), read it and produce binary files from there. AFAIR when I looked at serialization in julia last time I haven't seen feasible solution.
julia 1.0.2 DataFrames 0.19.4 CSV 0.5.17
G1_1e9_1e1_0_0
- killed OS OOM
/bin/bash: line 1: 6129 Killed ./juliadf/groupby-juliadf.jl > out/run_juliadf_groupby_G1_1e9_1e1_0_0.out 2> out/run_juliadf_groupby_G1_1e9_1e1_0_0.err
G1_1e9_1e2_0_[0|1]
ERROR: LoadError: OutOfMemoryError()
Stacktrace:
[1] Type at ./boot.jl:394 [inlined]
[2] copy(::CSV.Column{Float64,Float64}) at /home/jan/.julia/packages/CSV/yJFAJ/
src/tables.jl:16
[3] (::getfield(DataFrames, Symbol("##DataFrame#91#94")))(::Bool, ::Type, ::Arr
ay{AbstractArray{T,1} where T,1}, ::DataFrames.Index) at /home/jan/.julia/packag
es/DataFrames/yH0f6/src/dataframe/dataframe.jl:130
...
G1_1e9_2e0_0_0
- timeout
CSV.jl should be able to use less memory when types are provided explicitly, which could avoid the memory error with 50GB datasets.
See also https://github.com/h2oai/db-benchmark/pull/85 and https://github.com/JuliaData/CSV.jl/issues/432#issuecomment-541965272.