h2oai / db-benchmark

reproducible benchmark of database-like ops
https://h2oai.github.io/db-benchmark
Mozilla Public License 2.0
322 stars 85 forks source link

review groupby advanced questions #117

Closed jangorecki closed 4 years ago

jangorecki commented 4 years ago

In https://github.com/h2oai/db-benchmark/issues/60 we proposed new advanced groupby questions. 5 new questions has been added, q6-q10, stressing some more advanced functionalities. 4 of those 5 new questions were focused on stressing expression to be evaluated by group. 5th question was focused on high cardinality grouping, rather than actual by-group computation. So far sounds good. The problem is that 4 out of 5 questions happens to be very low cardinality grouping. All 4 questions do grouping by id2,id4 which for all k cases is low cardinality. Moreover it doesn't scale up with data size. We are getting following number of groups for 1e9 rows input: k=1e2 unsorted: 10'000 groups k=1e1 unsorted: 4 groups k=2e0 unsorted: 100 groups k=1e2 sorted: 10'000 groups Number of groups are constant for across data sizes (1e7, 1e8, 1e9) for corresponding k, while it should scale up with data size. Only id3 and id6 will scale here.

I propose to change grouping columns in 2 or 3 questions so the cardinality of groupings can be more widely explored by different values of k factor. @nalimilan @bkamins @mattdowle Comments are more than welcome!

bkamins commented 4 years ago

I think it is OK.

As a side question - when you get "out of memory" error do you have an information at which stage the process failed?

jangorecki commented 4 years ago

yes, they are documented as comments in the code, from where the actual exceptions are taken for benchplot, in benchplot-dict.R file: https://github.com/h2oai/db-benchmark/blob/7e178c1d2fb9102c8b12ac201f883981254a9df6/benchplot-dict.R#L146

bkamins commented 4 years ago

Thank you!

jangorecki commented 4 years ago

by amending grouping columns as follows

q6: median v3 sd v3 by id4 id5 (this was changed but cardinality stays the same)
q7: max v1 - min v2 by id3
q8: largest two v3 by id6
q9: regression v1 v2 by id2 id4 (this was not changed)

the k=1e2 gives now following number of groups, rather than just 10000

q6: 10000
q7: 10000000
q8: 10000000
q9: 10000
nalimilan commented 4 years ago

Thanks for asking, but TBH I don't have a strong opinion.

jangorecki commented 4 years ago

In the question 8 we are quering top 2 rows by group. Before the change there were no groups having just single row, I think this is changing now:

ANS = by(x, [:id6], largest2_v3 = :v3 => x -> partialsort(x, 1:2, rev=true)
#BoundsError: attempt to access 1-element Array{Float64, 1} at index [1:2]

@nalimilan @bkamins any tips how to handle that nicely, as head(., n=2) in R

nalimilan commented 4 years ago

I guess something like this would do:

ANS = by(x, [:id6], largest2_v3 = :v3 => x -> partialsort(x, 1:min(2, length(x)), rev=true)