new more advanced groupby questions

jangorecki commented 5 years ago

As discussed in #59, #57 and #12 there are more questions that could, and should be presented on benchmark. I am putting those together in this issue. The plan is to add 5 more questions, proposed questions are:

 q6: median v3 sd v3 by id2 id4
 q7: max v1 - min v2 by id2 id4
 q8: largest two v3 by id2 id4
 q9: regression v1 v2 by id2 id4 # cor(x,y)^2
q10: sum v3 count by id1:id6

And some rationale behind those questions:

q6 median it is widely used in data science, is it much more tricky to implement, same about sd
q7 another pretty common question to calculate range between two variables by group
q8 top n by group, i.e. checking two largest transactions by customer and time
q9 regression by group
q10 is commonly used operation in data warehouses where you have to (sub)aggregate fact table to all existing dimensions

grouping in most of the new questions is by id2 id4 which are low cardinality fields, but combined will result low-to-medium cardinality group count.

If you have any objections please shout here.

bkamins commented 5 years ago

cov(x,y)^2 / (var(x) * var(y))

A small comment is that the above is the formula for R^2 in a simple linear regression. We could also write cor(x,y)^2 to get the same value, but the time will vary:

> x = 1:10^8
> y = 1:10^8+runif(10^8)
> system.time(cor(x,y)^2)
   user  system elapsed 
   0.96    0.11    1.06 
> system.time(cor(x,y)^2)
   user  system elapsed 
   0.92    0.12    1.04 
> system.time(cov(x,y)^2/(var(x)*var(y)))
   user  system elapsed 
   1.57    0.14    1.70

Not sure which will be better to benchmark. People probably would normally calculate cor so maybe stick with this, but the initial idea was to have some more complex function (but given the examples above maybe we should stick to what normally a user would do and simply calculate a correlation).

nalimilan commented 5 years ago

The standard deviation would also be interesting to compute, as it's a bit more tricky to implement a fast version than for sum yet it's a common operation. A related operation it standardizing by group (but it's separate since computing the standard deviation will likely be a small part of the work).

jangorecki commented 5 years ago

added sd to q6

jangorecki commented 5 years ago

@nalimilan @bkamins sorry for little off-topic but what is the canonical equivalent of the following way to measure timing?

t_start = time_ns();
ANS = do_something();
println(size(ANS)); flush(stdout);
t = (time_ns() - t_start)/1.0e9;

I don't want to print anything, just collect wall-clock in seconds into a variable

bkamins commented 5 years ago

use @elapsed like this:

julia> using DataFrames

julia> t = @elapsed DataFrame(rand(100, 100))
0.872615806

julia> t = @elapsed DataFrame(rand(100, 100))
0.000155245

(as you can see it will include precompilation time the first time some function is called but I guess this is what you want to capture and that is why you run all the benchmarks 2 times - right?)

bkamins commented 5 years ago

@jangorecki:

and what about multi-line expression, as the one above?

@elapsed begin
    ANS = do_something()
    println(size(ANS))
    flush(stdout)
end

or

@elapsed (ANS = do_something(); println(size(ANS)); flush(stdout))

or wrap it in a function

jangorecki commented 5 years ago

Thanks, that helps to make code much more compact. I added new questions for julia in https://github.com/h2oai/db-benchmark/commit/388bd8bf56ca32635b1d532227fc5d7f907aa1c1 I couldn't figured out how to make q7 max v1 - min v2 by id2 id4, trying by(x, [:id2, :id4], x->(range_v1_v2=max(x.v1)-min(x.v2))) but it seems that max and min don't want to accept vectors that might have missings. Could you take a look at this @bkamins ? Also the syntax suggested in https://github.com/JuliaData/DataFrames.jl/issues/1625#issuecomment-445550526 is no longer giving a name to resulting column, that affects q8 and q9.

bkamins commented 5 years ago

use:

by(x, [:id2, :id4], range_v1_v2 = (:v1, :v2) => x->maximum(skipmissing(x.v1))-minimum(skipmissing(x.v2)))

The syntax suggested in the comment uses automatic column naming. In order to specify the column name use the syntax:

by(data_frame, grouping_columns, output_column_name = input_columns => function_to_execute)

is the syntax clear?

nalimilan commented 5 years ago

Note that DataFramesMeta should provide a more intuitive syntax for this kind of thing once we port it to the recent DataFrames improvements. BTW, you may want to wait a bit before running the benchmarks again: once we merge https://github.com/JuliaData/DataFrames.jl/pull/1651, the timings for standard reductions should improve a lot.

jangorecki commented 5 years ago

Thanks for feedback, changes amended. Will run full benchmark after linked PR will be merge. It is possible to track timing over time on new and simple report: https://h2oai.github.io/db-benchmark/history.html

jangorecki commented 5 years ago

q6-q10 are published as a separate benchplot right below the q1-q5 benchplot, on the groupby report, so from main page one has to navigate via link to groupby.html.

dplyr and julia has not yet been run for new questions on 1e8 and 1e9, waiting for the following PRs to be merged

dplyr https://github.com/tidyverse/dplyr/pull/4091
juliadf https://github.com/JuliaData/DataFrames.jl/pull/1651 (merged but not yet released)

additionally some of dask timings should be considered as inaccurate because the answers they produces at the moment does not match other solutions, details in https://github.com/dask/dask/issues/4372

1e9_2e0_0_0.advanced plot has been currently skipped as there are only spark timings for it

nalimilan commented 5 years ago

The new PRs have been released in DataFrames.jl 0.17. It would be interesting to see the results.

jangorecki commented 5 years ago

@nalimilan it was running and just recently got published. We set 2 hours timeout for a single script. Julia is suffering from that because of reading files from csv is relatively time consuming.

nalimilan commented 5 years ago

Thanks! It's nice to see we're on par with other software on second run at least on the cases where we now have optimized code. Unfortunately, timings for the advanced questions are probably dominated by grouping (rather than combining) time, since they use columns that are not converted to categorical due to their high number of unique levels. The solution to that is to support PooledDataArrays -- I'll work on that.

The CSV.jl slowness is annoying (and unexpected). I'll try to file an issue against CSV.jl, but if you can help creating a reproducer that would be great!

Three small remarks:

The name "juliadf" is still used in plots, it would be nice to use DataFrames.jl instead.
The first run on the first plot takes very very long. That's probably because it includes time to compile general grouping and combining code which is shared across all test cases. That's a bit problematic since it's completely dependent on the order in which tests are run. OTOH this cost has to be paid at some point, so... We could try to distinguish the compilation cost which is common to all grouping operations from the one which will have to be paid again for each specific command, by running a totally unrelated operation before actual tests. Would that we cheating? :-)
Maybe sort results by timings on the second run? For most implementations it doesn't make a big difference, but Julia is almost always the last due to compilation, which may not reflect the performance people will get in practice since it should be common to run multiple operations.

jangorecki commented 5 years ago

@nalimilan

since they use columns that are not converted to categorical due to their high number of unique levels.

if that was the problem only about CSV reader getting stuck we should then convert those columns to categorical after reading, this is what we do for pandas.

if you can help creating a reproducer that would be great!

does this count?

wget https://raw.githubusercontent.com/h2oai/db-benchmark/master/groupby-datagen.R
Rscript groupby-datagen.R 1e7 1e2 0 0

The name "juliadf" is still used in plots, it would be nice to use DataFrames.jl instead.

Left margin has limited space. I tried to put that using two lines for text as Julia\nDataFrames but second word was still too big. If you are OK with just julia or DF.jl then I can easily change that.

The first run on the first plot takes very very long. That's probably because it includes time to compile general grouping and combining code which is shared across all test cases ... Would that we cheating?

Cheating would be looking at the data. We can use dummy 0 rows dataframe of exact structure, which is a metadata, not data itself. Run query on that to compile commands. Will that work? IMO this will not be cheating because we know upfront what questions we want to ask against data. Assuming the task does not meant to reflect interactive data queries scenario but more a predefined processing workflow scenario @mattdowle.

Maybe sort results by timings on the second run?

One of the planned improvements to data.table grouping is that it will create index when grouping (as it needs to find the order of groups anyway), thus second run will be significantly faster as it will know the groups already. First run seems to be then more fair, we just need to exclude compile time from actual query time.

nalimilan commented 5 years ago

if that was the problem only about CSV reader getting stuck we should then convert those columns to categorical after reading, this is what we do for pandas.

Yes that's the same problem. I'll tell you when we have fixed that.

does this count?

wget https://raw.githubusercontent.com/h2oai/db-benchmark/master/groupby-datagen.R
Rscript groupby-datagen.R 1e7 1e2 0 0

Yes. :-)

Left margin has limited space. I tried to put that using two lines for text as Julia\nDataFrames but second word was still too big. If you are OK with just julia or DF.jl then I can easily change that.

DF.jl sounds good!

Cheating would be looking at the data. We can use dummy 0 rows dataframe of exact structure, which is a metadata, not data itself. Run query on that to compile commands. Will that work? IMO this will not be cheating because we know upfront what questions we want to ask against data. Assuming the task does not meant to reflect interactive data queries scenario but more a predefined processing workflow scenario @mattdowle.

Yes, that would work. Anyway the data shouldn't make any difference for Julia, only the column types. So you could do dummydf = df[1:1, :] and run queries on that first. But even running an arbitrary query would already eliminate a large part of the overhead, which is shared across operations.

One of the planned improvements to data.table grouping is that it will create index when grouping (as it needs to find the order of groups anyway), thus second run will be significantly faster as it will know the groups already. First run seems to be then more fair, we just need to exclude compile time from actual query time.

Agreed.

BTW, we have spotted something weird with the reported times: the overhead of the first run (compared to the second one) increases with the size of the dataset. Since compilation time is constant, I would have expected the difference between runs to diminish dramatically when increasing the dataset size. I'll look at the code again, but if you have an idea...

jangorecki commented 5 years ago

@nalimilan re "BTW": no idea, but if you are not able to reproduce that please create issue for that as it might be something related to environment, so has to be tracked.

mattdowle commented 5 years ago

On the pings to me, I'd say if there's a way to compile Julia/DataFrame.jl up front then that would be very fair to do in db-bench. So the 1st run time for Julia would look much better. After all, when a user installs data.table in R, the ByteCompile flag is on and that's an upfront one-time cost on installation which is not included in R timings. However, if Julia is looking at the column types/data and compiling at that point, then users will experience that compilation cost. What we're trying to mimic and represent fairly is the user experience. In my own experience I rarely needed to compute the same answer many times. I wanted the answer once. Especially in production there was a task that ran once and finished. It was the first timing that was important to me as a user because it was the only run.

@nalimilan It looks like you've already reached the thought that maybe it might not be just compilation time hurting the first run time. I see a difference of 4 seconds for the 0.5GB size and a difference of 10 seconds for the 5GB size. That's what you're referring to in the BTW I take it. As Jan says, please let us know if you can't reproduce so then we know to look at the db-bench environment and investigate.

Just to say as well that we do want to do end-to-end benchmarks too; i.e. start a cold new instance of the language (R/Python/Julia), import packages, load a dataset, run some reasonable analysis, produce reasonable output and close. Reporting the total time and peak RAM usage of that process, end-to-end.

nalimilan commented 5 years ago

On the pings to me, I'd say if there's a way to compile Julia/DataFrame.jl up front then that would be very fair to do in db-bench. So the 1st run time for Julia would look much better. After all, when a user installs data.table in R, the ByteCompile flag is on and that's an upfront one-time cost on installation which is not included in R timings. However, if Julia is looking at the column types/data and compiling at that point, then users will experience that compilation cost. What we're trying to mimic and represent fairly is the user experience. In my own experience I rarely needed to compute the same answer many times. I wanted the answer once. Especially in production there was a task that ran once and finished. It was the first timing that was important to me as a user because it was the only run.

Things are actually a bit complex. It is currently possible to precompile Julia packages to native code, and save the result to a file. But this process isn't streamlined yet, so users have to do it manually. Also, most of the compilation cost will be paid only once for a broad range of different operations, so it could really be paid only once when installing the package. So it's not easy to decide whether it should be taken into account in benchmarks or not.

I'll investigate why the overhead increases with the size of the data set. There's definitely something unexpected going on.

jangorecki commented 5 years ago

@nalimilan @bkamins julia df 0.17 improvement is clearly visible on https://h2oai.github.io/db-benchmark/history.html

nalimilan commented 5 years ago

Thanks, that's very useful to track potential regressions!

BTW, could you also change that occurrence of "juliadf" to "DataFrames.jl"?

jangorecki commented 5 years ago

@nalimilan this is more an internal report, we don't link to it from anywhere. It's quality isn't yet ready for putting it to public. It was added so we can at least spot regression. I hope to make proper "history" report, then it will be definitely DataFrames.jl there :)

h2oai / db-benchmark

new more advanced groupby questions #60