Closed bkamins closed 4 years ago
Thank you, will re-run soon.
Just letting you know. I run benchmark yet on the old version, before pulling changes in this PR, and it seems that deprecation process could be handled little better. There are some warnings about deprecation
┌ Warning: `combine(gd; target_col = source_cols => fun, ...)` is depreca
ted, use `combine(gd, source_cols => fun => :target_col, ...)` instead
│ caller = ip:0x0
└ @ Core :-1
but errors happens as well
ERROR: LoadError: MethodError: no method matching (::getfield(Main, Symbo
l("##3#4")))(::SubArray{Int64,1,Array{Int64,1},Tuple{Array{Int64,1}},fals
e}, ::SubArray{Int64,1,Array{Int64,1},Tuple{Array{Int64,1}},false})
Closest candidates are:
#3(::Any) at /home/jan/git/db-benchmark/juliadf/groupby-juliadf.jl:133
Stacktrace:
[1] do_call(::getfield(Main, Symbol("##3#4")), ::Array{Int64,1}, ::Array
{Int64,1}, ::Array{Int64,1}, ::GroupedDataFrame{DataFrame}, ::Tuple{Array
{Int64,1},Array{Int64,1}}, ::Int64) at /home/jan/.julia/packages/DataFram
es/3ZmR2/src/groupeddataframe/splitapplycombine.jl:717
[2] _combine(::Array{Pair,1}, ::GroupedDataFrame{DataFrame}, ::Array{Sym
bol,1}, ::Bool, ::Bool) at /home/jan/.julia/packages/DataFrames/3ZmR2/src
/groupeddataframe/splitapplycombine.jl:1078
[3] #combine_helper#395(::Bool, ::Bool, ::Bool, ::Bool, ::Function, ::Ar
ray{Pair,1}, ::GroupedDataFrame{DataFrame}, ::Array{Symbol,1}) at /home/j
an/.julia/packages/DataFrames/3ZmR2/src/groupeddataframe/splitapplycombin
e.jl:583
[4] (::getfield(DataFrames, Symbol("#kw##combine_helper")))(::NamedTuple
{(:keepkeys, :ungroup, :copycols, :keeprows),NTuple{4,Bool}}, ::typeof(Da
taFrames.combine_helper), ::Array{Pair,1}, ::GroupedDataFrame{DataFrame},
::Array{Symbol,1}) at ./none:0
[5] #_combine_prepare#383(::Bool, ::Bool, ::Bool, ::Bool, ::Function, ::
GroupedDataFrame{DataFrame}, ::Union{Colon, typeof(nrow), Regex, Abstract
String, Signed, Symbol, Unsigned, Pair, AbstractArray{T,1} where T, All,
Between, InvertedIndex}) at /home/jan/.julia/packages/DataFrames/3ZmR2/sr
c/groupeddataframe/splitapplycombine.jl:546
[6] #_combine_prepare at ./none:0 [inlined]
[7] #combine#382 at /home/jan/.julia/packages/DataFrames/3ZmR2/src/group
eddataframe/splitapplycombine.jl:466 [inlined]
[8] combine at /home/jan/.julia/packages/DataFrames/3ZmR2/src/groupeddat
aframe/splitapplycombine.jl:466 [inlined]
[9] #combine#392 at /home/jan/.julia/packages/DataFrames/3ZmR2/src/group
eddataframe/splitapplycombine.jl:558 [inlined]
[10] #combine at ./none:0 [inlined]
[11] #by#545(::Bool, ::Bool, ::Base.Iterators.Pairs{Symbol,Pair{Array{Sy
mbol,1},getfield(Main, Symbol("##3#4"))},Tuple{Symbol},NamedTuple{(:range
_v1_v2,),Tuple{Pair{Array{Symbol,1},getfield(Main, Symbol("##3#4"))}}}},
::Function, ::DataFrame, ::Array{Symbol,1}) at /home/jan/.julia/packages/
DataFrames/3ZmR2/src/deprecated.jl:355
[12] (::getfield(DataFrames, Symbol("#kw##by")))(::NamedTuple{(:range_v1
_v2,),Tuple{Pair{Array{Symbol,1},getfield(Main, Symbol("##3#4"))}}}, ::ty
peof(by), ::DataFrame, ::Array{Symbol,1}) at ./none:0
[13] top-level scope at util.jl:213
[14] include at ./boot.jl:317 [inlined]
[15] include_relative(::Module, ::String) at ./loading.jl:1044
[16] include(::Module, ::String) at ./sysimg.jl:29
[17] exec_options(::Base.JLOptions) at ./client.jl:231
[18] _start() at ./client.jl:425
in expression starting at /home/jan/git/db-benchmark/juliadf/groupby-juli
adf.jl:133
it seems that deprecation process could be handled little better
Thank you for reporting. I have run a similar test on some sample data and it went through cleanly (which means we have some corner case here - which is good to know :)). Thank you!
@nalimilan - I have tracked down the problem with deprecation warning that @jangorecki pointed to.
The reason is that in calls like by(df, :col, targetcol = [:col1, :col2] => fun)
we switched from passing a NamedTuple
to passing positional arguments to fun
.
We do not handle this change correctly now in the deprecation code. However, it is hard do to it correctly in 100% of cases (of course it is doable). Do you think it is worth to make a fix of this?
Initially I have omitted it as by(df, :col, targetcol = [:col1, :col2] => fun)
form (getting a single column as a result of passing multiple columns to fun
) is a relatively rare in normal user code (it happens in the benchmark as we are stress-testing the package though :smile:).
What kind of thing could we do to handle that deprecation?
In the deprecation code of by
we could detect if the user passed multiple columns as a source and then wrap in in AsTable
. The problem is that the deprecation warning will get very messy (as we should not use AsTable
if only a single column was passed so there will be a complex comprehension checking if in the Pair
the first element is a ColumnIndex
or MultiColumnIndex
and then wrapping it in AsTable
or not as appropriate).
Alternatively we could not fix it but rather print an additional message saying what I have written above. It should be good enough as we will error later so at least the user will not silently get a wrong result.
Getting old code to work would already be very nice, even if the warning doesn't show you the exact code you need to write to use the new syntax.
OK - I will give it a stab.
This PR leads to a significant speed up. See https://h2oai.github.io/db-benchmark/history.html for exact differences. Time of queries on 1e9 data got reduced up to 4 times in some cases.
As a results none of the scripts is now killed due to timeout define for script. The only failed scripts are now due to OutOfMemoryError.
G1_1e9_1e1_0_0
ERROR: LoadError: OutOfMemoryError()
Stacktrace:
[1] Array at ./boot.jl:405 [inlined]
[2] copy(::CSV.Column{Float64,Float64}) at /home/jan/.julia/packages/CSV/vyG0T/src/tables.jl:60
[3] DataFrame(::Array{AbstractArray{T,1} where T,1}, ::DataFrames.Index; copycols::Bool) at /home/jan/.julia/packages/DataFrames/3ZmR2/src/dataframe/dataframe.jl:148
[4] DataFrame(::Array{Union{CSV.Column, CSV.Column2},1}, ::Array{Symbol,1}; makeunique::Bool, copycols::Bool) at /home/jan/.julia/packages/DataFrames/3ZmR2/src/dataframe/dataframe.jl:242
[5] #DataFrame#85 at /home/jan/.julia/packages/CSV/vyG0T/src/CSV.jl:43 [inlined]
[6] DataFrame(::CSV.File{false}) at /home/jan/.julia/packages/CSV/vyG0T/src/CSV.jl:43
[7] top-level scope at /home/jan/git/db-benchmark/juliadf/groupby-juliadf.jl:26
[8] include(::Module, ::String) at ./Base.jl:377
[9] exec_options(::Base.JLOptions) at ./client.jl:288
[10] _start() at ./client.jl:484
in expression starting at /home/jan/git/db-benchmark/juliadf/groupby-juliadf.jl:26
G1_1e9_2e0_0_0
ERROR: LoadError: OutOfMemoryError()
Stacktrace:
[1] Array at ./boot.jl:405 [inlined]
[2] copy(::CSV.Column{Int64,Int64}) at /home/jan/.julia/packages/CSV/vyG0T/src/tables.jl:60
[3] DataFrame(::Array{AbstractArray{T,1} where T,1}, ::DataFrames.Index; copycols::Bool) at /home/jan/.julia/packages/DataFrames/3ZmR2/src/dataframe/dataframe.jl:148
[4] DataFrame(::Array{Union{CSV.Column, CSV.Column2},1}, ::Array{Symbol,1}; makeunique::Bool, copycols::Bool) at /home/jan/.julia/packages/DataFrames/3ZmR2/src/dataframe/dataframe.jl:242
[5] #DataFrame#85 at /home/jan/.julia/packages/CSV/vyG0T/src/CSV.jl:43 [inlined]
[6] DataFrame(::CSV.File{false}) at /home/jan/.julia/packages/CSV/vyG0T/src/CSV.jl:43
[7] top-level scope at /home/jan/git/db-benchmark/juliadf/groupby-juliadf.jl:26
[8] include(::Module, ::String) at ./Base.jl:377
[9] exec_options(::Base.JLOptions) at ./client.jl:288
[10] _start() at ./client.jl:484
in expression starting at /home/jan/git/db-benchmark/juliadf/groupby-juliadf.jl:26
J1_1e9_NA_0_0
(this is join, not groupby)
ERROR: LoadError: OutOfMemoryError()
Stacktrace:
[1] Array at ./boot.jl:405 [inlined]
[2] rehash!(::Dict{String,UInt64}, ::Int64) at ./dict.jl:192
[3] _setindex! at ./dict.jl:367 [inlined]
[4] getref!(::Dict{String,UInt64}, ::CSV.PointerString, ::Array{UInt64,1}, ::Int64, ::Int16, ::Parsers.Options{false,false,true,false,Missing,UInt8,Nothing}) at /home/jan/.julia/packages/CSV/vyG0T/src/file.jl:840
[5] parsepooled!(::Int8, ::Array{UInt64,1}, ::Array{UInt8,1}, ::Int64, ::Int64, ::Parsers.Options{false,false,true,false,Missing,UInt8,Nothing}, ::Int64, ::Int64, ::Int64, ::Float64, ::Array{Dict{String,UInt64},1}, ::Array{UInt64,1}, ::Array{Int8,1}, ::Array{Array{UInt64,1},1}) at /home/jan/.julia/packages/CSV/vyG0T/src/file.jl:862
[6] parserow at /home/jan/.julia/packages/CSV/vyG0T/src/file.jl:524 [inlined]
[7] parsetape(::Val{false}, ::Int64, ::Dict{Int8,Int8}, ::Array{Array{UInt64,1},1}, ::Array{Array{UInt64,1},1}, ::Array{UInt8,1}, ::Int64, ::Int64, ::Int64, ::Array{Int64,1}, ::Float64, ::Array{Dict{String,UInt64},1}, ::Array{UInt64,1}, ::Int64, ::Array{Int8,1}, ::Array{Int64,1}, ::Bool, ::Parsers.Options{false,false,true,false,Missing,UInt8,Nothing}, ::Nothing) at /home/jan/.julia/packages/CSV/vyG0T/src/file.jl:455
[8] CSV.File(::String; header::Int64, normalizenames::Bool, datarow::Int64, skipto::Nothing, footerskip::Int64, limit::Int64, transpose::Bool, comment::Nothing, use_mmap::Bool, ignoreemptylines::Bool, threaded::Nothing, select::Nothing, drop::Nothing, missingstrings::Array{String,1}, missingstring::String, delim::Nothing, ignorerepeated::Bool, quotechar::Char, openquotechar::Nothing, closequotechar::Nothing, escapechar::Char, dateformat::Nothing, dateformats::Nothing, decimal::UInt8, truestrings::Array{String,1}, falsestrings::Array{String,1}, type::Nothing, types::Nothing, typemap::Dict{Int8,Int8}, categorical::Bool, pool::Bool, strict::Bool, silencewarnings::Bool, debug::Bool, parsingdebug::Bool) at /home/jan/.julia/packages/CSV/vyG0T/src/file.jl:252
[9] top-level scope at /home/jan/git/db-benchmark/juliadf/join-juliadf.jl:30
[10] include(::Module, ::String) at ./Base.jl:377
[11] exec_options(::Base.JLOptions) at ./client.jl:288
[12] _start() at ./client.jl:484
in expression starting at /home/jan/git/db-benchmark/juliadf/join-juliadf.jl:30
joins
are on a to-do list to get optimized (we have not touched them yet).
The two other errors are CSV.jl related (@quinnj - memory management of reading CSVs is a long pending thing that hopefully could be improved in comparison to other frameworks - do you see any chance for this in the near future)
Have you tried to load the dataset with CSVFiles.jl? load(filename) |> DataFrame
. At least in my experience it deals pretty well with very large files that end up being close to the memory available on the system.
@davidanthoff - it would be great if you passed the codes that would do the job as you know the package best (just please note that we specify column types on load). Thank you!
If you don't want to specify the column types, you just do:
df = load("foo.csv") |> DataFrame
If you want to specify column types, it would be like this:
load("foo.csv", colparsers=[String,Int,Union{Missing,Float64}]) |> DataFrame
There is no concept of pooling in CSVFiles.jl, so you would need to convert any column you want pooled after the reading to a categorical representation.
I have updated the codes to the latest release of DataFrames.jl (and current stable Julia version). Hopefully I have not messed up anything.
CC @nalimilan