h2oai / db-benchmark

reproducible benchmark of database-like ops
https://h2oai.github.io/db-benchmark
Mozilla Public License 2.0
322 stars 85 forks source link

update Julia benchmark #145

Closed bkamins closed 4 years ago

bkamins commented 4 years ago

I have updated the codes to the latest release of DataFrames.jl (and current stable Julia version). Hopefully I have not messed up anything.

CC @nalimilan

jangorecki commented 4 years ago

Thank you, will re-run soon.

jangorecki commented 4 years ago

Just letting you know. I run benchmark yet on the old version, before pulling changes in this PR, and it seems that deprecation process could be handled little better. There are some warnings about deprecation

┌ Warning: `combine(gd; target_col = source_cols => fun, ...)` is depreca
ted, use `combine(gd, source_cols => fun => :target_col, ...)` instead
│   caller = ip:0x0
└ @ Core :-1

but errors happens as well

ERROR: LoadError: MethodError: no method matching (::getfield(Main, Symbo
l("##3#4")))(::SubArray{Int64,1,Array{Int64,1},Tuple{Array{Int64,1}},fals
e}, ::SubArray{Int64,1,Array{Int64,1},Tuple{Array{Int64,1}},false})
Closest candidates are:
  #3(::Any) at /home/jan/git/db-benchmark/juliadf/groupby-juliadf.jl:133
Stacktrace:
 [1] do_call(::getfield(Main, Symbol("##3#4")), ::Array{Int64,1}, ::Array
{Int64,1}, ::Array{Int64,1}, ::GroupedDataFrame{DataFrame}, ::Tuple{Array
{Int64,1},Array{Int64,1}}, ::Int64) at /home/jan/.julia/packages/DataFram
es/3ZmR2/src/groupeddataframe/splitapplycombine.jl:717
 [2] _combine(::Array{Pair,1}, ::GroupedDataFrame{DataFrame}, ::Array{Sym
bol,1}, ::Bool, ::Bool) at /home/jan/.julia/packages/DataFrames/3ZmR2/src
/groupeddataframe/splitapplycombine.jl:1078
 [3] #combine_helper#395(::Bool, ::Bool, ::Bool, ::Bool, ::Function, ::Ar
ray{Pair,1}, ::GroupedDataFrame{DataFrame}, ::Array{Symbol,1}) at /home/j
an/.julia/packages/DataFrames/3ZmR2/src/groupeddataframe/splitapplycombin
e.jl:583
 [4] (::getfield(DataFrames, Symbol("#kw##combine_helper")))(::NamedTuple
{(:keepkeys, :ungroup, :copycols, :keeprows),NTuple{4,Bool}}, ::typeof(Da
taFrames.combine_helper), ::Array{Pair,1}, ::GroupedDataFrame{DataFrame},
 ::Array{Symbol,1}) at ./none:0
 [5] #_combine_prepare#383(::Bool, ::Bool, ::Bool, ::Bool, ::Function, ::
GroupedDataFrame{DataFrame}, ::Union{Colon, typeof(nrow), Regex, Abstract
String, Signed, Symbol, Unsigned, Pair, AbstractArray{T,1} where T, All, 
Between, InvertedIndex}) at /home/jan/.julia/packages/DataFrames/3ZmR2/sr
c/groupeddataframe/splitapplycombine.jl:546
 [6] #_combine_prepare at ./none:0 [inlined]
 [7] #combine#382 at /home/jan/.julia/packages/DataFrames/3ZmR2/src/group
eddataframe/splitapplycombine.jl:466 [inlined]
 [8] combine at /home/jan/.julia/packages/DataFrames/3ZmR2/src/groupeddat
aframe/splitapplycombine.jl:466 [inlined]
 [9] #combine#392 at /home/jan/.julia/packages/DataFrames/3ZmR2/src/group
eddataframe/splitapplycombine.jl:558 [inlined]
 [10] #combine at ./none:0 [inlined]
 [11] #by#545(::Bool, ::Bool, ::Base.Iterators.Pairs{Symbol,Pair{Array{Sy
mbol,1},getfield(Main, Symbol("##3#4"))},Tuple{Symbol},NamedTuple{(:range
_v1_v2,),Tuple{Pair{Array{Symbol,1},getfield(Main, Symbol("##3#4"))}}}}, 
::Function, ::DataFrame, ::Array{Symbol,1}) at /home/jan/.julia/packages/
DataFrames/3ZmR2/src/deprecated.jl:355
 [12] (::getfield(DataFrames, Symbol("#kw##by")))(::NamedTuple{(:range_v1
_v2,),Tuple{Pair{Array{Symbol,1},getfield(Main, Symbol("##3#4"))}}}, ::ty
peof(by), ::DataFrame, ::Array{Symbol,1}) at ./none:0
 [13] top-level scope at util.jl:213
 [14] include at ./boot.jl:317 [inlined]
 [15] include_relative(::Module, ::String) at ./loading.jl:1044
 [16] include(::Module, ::String) at ./sysimg.jl:29
 [17] exec_options(::Base.JLOptions) at ./client.jl:231
 [18] _start() at ./client.jl:425
in expression starting at /home/jan/git/db-benchmark/juliadf/groupby-juli
adf.jl:133
bkamins commented 4 years ago

it seems that deprecation process could be handled little better

Thank you for reporting. I have run a similar test on some sample data and it went through cleanly (which means we have some corner case here - which is good to know :)). Thank you!

bkamins commented 4 years ago

@nalimilan - I have tracked down the problem with deprecation warning that @jangorecki pointed to.

The reason is that in calls like by(df, :col, targetcol = [:col1, :col2] => fun) we switched from passing a NamedTuple to passing positional arguments to fun.

We do not handle this change correctly now in the deprecation code. However, it is hard do to it correctly in 100% of cases (of course it is doable). Do you think it is worth to make a fix of this? Initially I have omitted it as by(df, :col, targetcol = [:col1, :col2] => fun) form (getting a single column as a result of passing multiple columns to fun) is a relatively rare in normal user code (it happens in the benchmark as we are stress-testing the package though :smile:).

nalimilan commented 4 years ago

What kind of thing could we do to handle that deprecation?

bkamins commented 4 years ago

In the deprecation code of by we could detect if the user passed multiple columns as a source and then wrap in in AsTable. The problem is that the deprecation warning will get very messy (as we should not use AsTable if only a single column was passed so there will be a complex comprehension checking if in the Pair the first element is a ColumnIndex or MultiColumnIndex and then wrapping it in AsTable or not as appropriate).

Alternatively we could not fix it but rather print an additional message saying what I have written above. It should be good enough as we will error later so at least the user will not silently get a wrong result.

nalimilan commented 4 years ago

Getting old code to work would already be very nice, even if the warning doesn't show you the exact code you need to write to use the new syntax.

bkamins commented 4 years ago

OK - I will give it a stab.

jangorecki commented 4 years ago

This PR leads to a significant speed up. See https://h2oai.github.io/db-benchmark/history.html for exact differences. Time of queries on 1e9 data got reduced up to 4 times in some cases.

jangorecki commented 4 years ago

As a results none of the scripts is now killed due to timeout define for script. The only failed scripts are now due to OutOfMemoryError.

G1_1e9_1e1_0_0

ERROR: LoadError: OutOfMemoryError()
Stacktrace:
 [1] Array at ./boot.jl:405 [inlined]
 [2] copy(::CSV.Column{Float64,Float64}) at /home/jan/.julia/packages/CSV/vyG0T/src/tables.jl:60
 [3] DataFrame(::Array{AbstractArray{T,1} where T,1}, ::DataFrames.Index; copycols::Bool) at /home/jan/.julia/packages/DataFrames/3ZmR2/src/dataframe/dataframe.jl:148
 [4] DataFrame(::Array{Union{CSV.Column, CSV.Column2},1}, ::Array{Symbol,1}; makeunique::Bool, copycols::Bool) at /home/jan/.julia/packages/DataFrames/3ZmR2/src/dataframe/dataframe.jl:242
 [5] #DataFrame#85 at /home/jan/.julia/packages/CSV/vyG0T/src/CSV.jl:43 [inlined]
 [6] DataFrame(::CSV.File{false}) at /home/jan/.julia/packages/CSV/vyG0T/src/CSV.jl:43
 [7] top-level scope at /home/jan/git/db-benchmark/juliadf/groupby-juliadf.jl:26
 [8] include(::Module, ::String) at ./Base.jl:377
 [9] exec_options(::Base.JLOptions) at ./client.jl:288
 [10] _start() at ./client.jl:484
in expression starting at /home/jan/git/db-benchmark/juliadf/groupby-juliadf.jl:26

G1_1e9_2e0_0_0

ERROR: LoadError: OutOfMemoryError()
Stacktrace:
 [1] Array at ./boot.jl:405 [inlined]
 [2] copy(::CSV.Column{Int64,Int64}) at /home/jan/.julia/packages/CSV/vyG0T/src/tables.jl:60
 [3] DataFrame(::Array{AbstractArray{T,1} where T,1}, ::DataFrames.Index; copycols::Bool) at /home/jan/.julia/packages/DataFrames/3ZmR2/src/dataframe/dataframe.jl:148
 [4] DataFrame(::Array{Union{CSV.Column, CSV.Column2},1}, ::Array{Symbol,1}; makeunique::Bool, copycols::Bool) at /home/jan/.julia/packages/DataFrames/3ZmR2/src/dataframe/dataframe.jl:242
 [5] #DataFrame#85 at /home/jan/.julia/packages/CSV/vyG0T/src/CSV.jl:43 [inlined]
 [6] DataFrame(::CSV.File{false}) at /home/jan/.julia/packages/CSV/vyG0T/src/CSV.jl:43
 [7] top-level scope at /home/jan/git/db-benchmark/juliadf/groupby-juliadf.jl:26
 [8] include(::Module, ::String) at ./Base.jl:377
 [9] exec_options(::Base.JLOptions) at ./client.jl:288
 [10] _start() at ./client.jl:484
in expression starting at /home/jan/git/db-benchmark/juliadf/groupby-juliadf.jl:26

J1_1e9_NA_0_0 (this is join, not groupby)

ERROR: LoadError: OutOfMemoryError()
Stacktrace:
 [1] Array at ./boot.jl:405 [inlined]
 [2] rehash!(::Dict{String,UInt64}, ::Int64) at ./dict.jl:192
 [3] _setindex! at ./dict.jl:367 [inlined]
 [4] getref!(::Dict{String,UInt64}, ::CSV.PointerString, ::Array{UInt64,1}, ::Int64, ::Int16, ::Parsers.Options{false,false,true,false,Missing,UInt8,Nothing}) at /home/jan/.julia/packages/CSV/vyG0T/src/file.jl:840
 [5] parsepooled!(::Int8, ::Array{UInt64,1}, ::Array{UInt8,1}, ::Int64, ::Int64, ::Parsers.Options{false,false,true,false,Missing,UInt8,Nothing}, ::Int64, ::Int64, ::Int64, ::Float64, ::Array{Dict{String,UInt64},1}, ::Array{UInt64,1}, ::Array{Int8,1}, ::Array{Array{UInt64,1},1}) at /home/jan/.julia/packages/CSV/vyG0T/src/file.jl:862
 [6] parserow at /home/jan/.julia/packages/CSV/vyG0T/src/file.jl:524 [inlined]
 [7] parsetape(::Val{false}, ::Int64, ::Dict{Int8,Int8}, ::Array{Array{UInt64,1},1}, ::Array{Array{UInt64,1},1}, ::Array{UInt8,1}, ::Int64, ::Int64, ::Int64, ::Array{Int64,1}, ::Float64, ::Array{Dict{String,UInt64},1}, ::Array{UInt64,1}, ::Int64, ::Array{Int8,1}, ::Array{Int64,1}, ::Bool, ::Parsers.Options{false,false,true,false,Missing,UInt8,Nothing}, ::Nothing) at /home/jan/.julia/packages/CSV/vyG0T/src/file.jl:455
 [8] CSV.File(::String; header::Int64, normalizenames::Bool, datarow::Int64, skipto::Nothing, footerskip::Int64, limit::Int64, transpose::Bool, comment::Nothing, use_mmap::Bool, ignoreemptylines::Bool, threaded::Nothing, select::Nothing, drop::Nothing, missingstrings::Array{String,1}, missingstring::String, delim::Nothing, ignorerepeated::Bool, quotechar::Char, openquotechar::Nothing, closequotechar::Nothing, escapechar::Char, dateformat::Nothing, dateformats::Nothing, decimal::UInt8, truestrings::Array{String,1}, falsestrings::Array{String,1}, type::Nothing, types::Nothing, typemap::Dict{Int8,Int8}, categorical::Bool, pool::Bool, strict::Bool, silencewarnings::Bool, debug::Bool, parsingdebug::Bool) at /home/jan/.julia/packages/CSV/vyG0T/src/file.jl:252
 [9] top-level scope at /home/jan/git/db-benchmark/juliadf/join-juliadf.jl:30
 [10] include(::Module, ::String) at ./Base.jl:377
 [11] exec_options(::Base.JLOptions) at ./client.jl:288
 [12] _start() at ./client.jl:484
in expression starting at /home/jan/git/db-benchmark/juliadf/join-juliadf.jl:30
bkamins commented 4 years ago

joins are on a to-do list to get optimized (we have not touched them yet).

The two other errors are CSV.jl related (@quinnj - memory management of reading CSVs is a long pending thing that hopefully could be improved in comparison to other frameworks - do you see any chance for this in the near future)

davidanthoff commented 4 years ago

Have you tried to load the dataset with CSVFiles.jl? load(filename) |> DataFrame. At least in my experience it deals pretty well with very large files that end up being close to the memory available on the system.

bkamins commented 4 years ago

@davidanthoff - it would be great if you passed the codes that would do the job as you know the package best (just please note that we specify column types on load). Thank you!

davidanthoff commented 4 years ago

If you don't want to specify the column types, you just do:

df = load("foo.csv") |> DataFrame

If you want to specify column types, it would be like this:

load("foo.csv", colparsers=[String,Int,Union{Missing,Float64}]) |> DataFrame

There is no concept of pooling in CSVFiles.jl, so you would need to convert any column you want pooled after the reading to a categorical representation.