add more complex groupby questions

jangorecki commented 6 years ago

Currently among 5 questions 4 are grouping by single column and one by 2 columns. We need more complex grouping queries to group by most of the columns, especially:

[x] by all dimensions, so all id* columns, and aggregate v1,v2,v3 (process of producing data mart)
[x] K=10 and K=2
[ ] ~by all columns (detecting duplicated rows)~ less real-life use cases - especially grouping by double type
[x] sorted input

prerequisite: https://github.com/h2oai/datatable/issues/1082

st-pasha commented 6 years ago

In my experiments with comparing performance of pandas v datatable, I noticed the speed differential depends strongly on the cardinality of the groupby column (K):

groupby-ratio

Given this, it might me interesting to compare several modes of groupby:

K ~ N (for example K = N/10)
K ~ N^½
K ~ 1 (i.e. K remains constant as N grows)

Also, we can probably perform more measurements than just N=1e7, 1e8, 1e9 -- measuring for intermediate levels of N's will allow us to draw charts that are more smooth, have more details, and therefore look more respectable.

jangorecki commented 6 years ago

I added tests for K=2 and K=10, the results are following numbers of groups:

|   K| in_rows|  q1|    q2|        q3|  q4|        q5|
|---:|-------:|---:|-----:|---------:|---:|---------:|
|   2|   1e+07|   2|     4|   4323484|   2|   4323579|
|   2|   1e+08|   2|     4|  43231389|   2|  43232226|
|   2|   1e+09|   2|     4| 431884560|   2| 431876300|
|  10|   1e+07|  10|   100|    999951|  10|    999969|
|  10|   1e+08|  10|   100|   9999518|  10|   9999512|
|  10|   1e+09|  10|   100|  99995425|  10|  99995357|
| 100|   1e+07| 100| 10000|    100000| 100|    100000|
| 100|   1e+08| 100| 10000|   1000000| 100|   1000000|
| 100|   1e+09| 100| 10000|  10000000| 100|  10000000|

the bigger K the less variance between different question in terms of number of groups on output, so also timing

mattdowle commented 6 years ago

@pasha wrote

measuring for intermediate levels of N's will allow us to draw charts that are more smooth, have more details, and therefore look more respectable.

Yes, in an ideal world. But note that db-bench takes 14 hours to run, already, with just 3 sizes of N.

For example, here is the smooth plot of fsort performance on X1 done last year by Patrick with 10 values of N. It took two days runtime and cost $500. Those are the practical considerations.

fsort_on_x1

pasha commented 6 years ago

wrong pasha @st-pasha ⬆️

mattdowle commented 6 years ago

Oops, I need another coffee. Thanks @pasha.

st-pasha commented 6 years ago

@mattdowle Certainly, runtime cost is an important variable to consider. For a test case that is linear in N, the total cost will be quadratic in N. For example, in the benchmark above reducing the upper limit from 900GB to 50GB (which is the upper limit for our groupby challenge) would reduce the total cost to just $1.50 -- already a much more modest amount.

Generally, we could figure out other ways of reducing the cost. For example, if pandas's latest version is from 2018-08-04 (as of today), then we could have avoided re-running pandas benchmarks every day for the past 90 days. I would suspect that keeping an accurate track of what needs and what doesn't need to be re-run will allow us to reduce the total (amortized) cost by a factor of 100 or so (especially if pandas is slower than other solutions, so would cost more). Or alternatively, allow us to benchmark more things every day.

I watched your London presentation today, and I like how you put it that this benchmark is a service to all developers out there; and that there really isn't anything comparable to it. It would be great to put more effort into improving this service: diversify things that are measured, improve the presentation, documentation, spend more effort promoting it, etc.

jangorecki commented 6 years ago

Generally, we could figure out other ways of reducing the cost. For example, if pandas's latest version is from 2018-08-04 (as of today), then we could have avoided re-running pandas benchmarks every day for the past 90 days. I would suspect that keeping an accurate track of what needs and what doesn't need to be re-run will allow us to reduce the total (amortized) cost by a factor of 100 or so (especially if pandas is slower than other solutions, so would cost more). Or alternatively, allow us to benchmark more things every day.

Yes we decided to do that recently, still it is not trivial as it has to take into account that current run might be testing new data (different K factor, in_rows, NA percentage) or new question, eventually a fun field (although currently not used). And only if all currently attempted to run data and questions for a solution were already run in exactly the same commit/version then the run can be skipped. It might also happen that only first run was successfully run, and second timing is missing. Quite a lot of bookkeeping. And also a lock file, as total benchmark batch time can vary, we need to ensure there is only one running at a time.

I watched your London presentation today, and I like how you put it that this benchmark is a service to all developers out there; and that there really isn't anything comparable to it. It would be great to put more effort into improving this service: diversify things that are measured, improve the presentation, documentation, spend more effort promoting it, etc.

Ideally would be to have developers of the tools to review or PR code for their solutions. As for promotion I believe adding more tasks like join, sort and others could help. Not everyone needs to perform grouping.

mattdowle commented 6 years ago

@st-pasha

For example, in the benchmark above reducing the upper limit from 900GB to 50GB (which is the upper limit for our groupby challenge) would reduce the total cost to just $1.50 -- already a much more modest amount.

True. But then it wouldn't test 900GB. The very point was to test as big as we could achieve: to see if sort would even work at the 900GB size in R and then to compare it to something else that worked at that size too (Thread Building Blocks). It might be possible to run a 1.8TB test since EC2 X1 has 2TB of RAM. But currently fsort writes to a full new result; enough RAM for one full-copy of the input is needed. That's one of several downsides of fsort and is something that I wanted to measure. Ideally the chart's x-axis I showed above should have gone up to 2TB and maybe TBB would have worked where data.table::fsort would have failed. But I've only just thought of that.

For example, if pandas's latest version is from 2018-08-04 (as of today), then we could have avoided re-running pandas benchmarks every day for the past 90 days.

True. But the idea of db-bench is to test the very latest daily dev versions of each product. I don't know why pandas was so stale at that stage. I guess it was an issue Jan was having building pandas-dev because they don't make a binary available daily whereas we do in R data.table. The idea of db-bench is to get a rough idea of the very latest daily dev version of each product (including where the fail point limit is) rather than a very precise view of an old stale version of each product.

jangorecki commented 5 years ago

first task within this issue has been shifted to #60, thus closing this issue as other tasks has been accomplished already

h2oai / db-benchmark

add more complex groupby questions #12