Closed jangorecki closed 5 years ago
In my experiments with comparing performance of pandas v datatable, I noticed the speed differential depends strongly on the cardinality of the groupby column (K
):
Given this, it might me interesting to compare several modes of groupby:
K ~ N
(for example K = N/10
)K ~ N^½
K ~ 1
(i.e. K
remains constant as N
grows)Also, we can probably perform more measurements than just N=1e7, 1e8, 1e9
-- measuring for intermediate levels of N's will allow us to draw charts that are more smooth, have more details, and therefore look more respectable.
I added tests for K=2 and K=10, the results are following numbers of groups:
| K| in_rows| q1| q2| q3| q4| q5|
|---:|-------:|---:|-----:|---------:|---:|---------:|
| 2| 1e+07| 2| 4| 4323484| 2| 4323579|
| 2| 1e+08| 2| 4| 43231389| 2| 43232226|
| 2| 1e+09| 2| 4| 431884560| 2| 431876300|
| 10| 1e+07| 10| 100| 999951| 10| 999969|
| 10| 1e+08| 10| 100| 9999518| 10| 9999512|
| 10| 1e+09| 10| 100| 99995425| 10| 99995357|
| 100| 1e+07| 100| 10000| 100000| 100| 100000|
| 100| 1e+08| 100| 10000| 1000000| 100| 1000000|
| 100| 1e+09| 100| 10000| 10000000| 100| 10000000|
the bigger K the less variance between different question in terms of number of groups on output, so also timing
@pasha wrote
measuring for intermediate levels of N's will allow us to draw charts that are more smooth, have more details, and therefore look more respectable.
Yes, in an ideal world. But note that db-bench takes 14 hours to run, already, with just 3 sizes of N.
For example, here is the smooth plot of fsort performance on X1 done last year by Patrick with 10 values of N. It took two days runtime and cost $500. Those are the practical considerations.
wrong pasha @st-pasha ⬆️
Oops, I need another coffee. Thanks @pasha.
@mattdowle Certainly, runtime cost is an important variable to consider. For a test case that is linear in N
, the total cost will be quadratic in N
. For example, in the benchmark above reducing the upper limit from 900GB to 50GB (which is the upper limit for our groupby challenge) would reduce the total cost to just $1.50 -- already a much more modest amount.
Generally, we could figure out other ways of reducing the cost. For example, if pandas
's latest version is from 2018-08-04 (as of today), then we could have avoided re-running pandas benchmarks every day for the past 90 days. I would suspect that keeping an accurate track of what needs and what doesn't need to be re-run will allow us to reduce the total (amortized) cost by a factor of 100 or so (especially if pandas is slower than other solutions, so would cost more). Or alternatively, allow us to benchmark more things every day.
I watched your London presentation today, and I like how you put it that this benchmark is a service to all developers out there; and that there really isn't anything comparable to it. It would be great to put more effort into improving this service: diversify things that are measured, improve the presentation, documentation, spend more effort promoting it, etc.
Generally, we could figure out other ways of reducing the cost. For example, if pandas's latest version is from 2018-08-04 (as of today), then we could have avoided re-running pandas benchmarks every day for the past 90 days. I would suspect that keeping an accurate track of what needs and what doesn't need to be re-run will allow us to reduce the total (amortized) cost by a factor of 100 or so (especially if pandas is slower than other solutions, so would cost more). Or alternatively, allow us to benchmark more things every day.
Yes we decided to do that recently, still it is not trivial as it has to take into account that current run might be testing new data
(different K factor, in_rows, NA percentage) or new question
, eventually a fun
field (although currently not used). And only if all currently attempted to run data and questions for a solution were already run in exactly the same commit/version then the run can be skipped. It might also happen that only first run was successfully run, and second timing is missing. Quite a lot of bookkeeping.
And also a lock file, as total benchmark batch time can vary, we need to ensure there is only one running at a time.
I watched your London presentation today, and I like how you put it that this benchmark is a service to all developers out there; and that there really isn't anything comparable to it. It would be great to put more effort into improving this service: diversify things that are measured, improve the presentation, documentation, spend more effort promoting it, etc.
Ideally would be to have developers of the tools to review or PR code for their solutions.
As for promotion I believe adding more tasks like join
, sort
and others could help. Not everyone needs to perform grouping.
@st-pasha
For example, in the benchmark above reducing the upper limit from 900GB to 50GB (which is the upper limit for our groupby challenge) would reduce the total cost to just $1.50 -- already a much more modest amount.
True. But then it wouldn't test 900GB. The very point was to test as big as we could achieve: to see if sort would even work at the 900GB size in R and then to compare it to something else that worked at that size too (Thread Building Blocks). It might be possible to run a 1.8TB test since EC2 X1 has 2TB of RAM. But currently fsort writes to a full new result; enough RAM for one full-copy of the input is needed. That's one of several downsides of fsort and is something that I wanted to measure. Ideally the chart's x-axis I showed above should have gone up to 2TB and maybe TBB would have worked where data.table::fsort would have failed. But I've only just thought of that.
For example, if pandas's latest version is from 2018-08-04 (as of today), then we could have avoided re-running pandas benchmarks every day for the past 90 days.
True. But the idea of db-bench is to test the very latest daily dev versions of each product. I don't know why pandas was so stale at that stage. I guess it was an issue Jan was having building pandas-dev because they don't make a binary available daily whereas we do in R data.table. The idea of db-bench is to get a rough idea of the very latest daily dev version of each product (including where the fail point limit is) rather than a very precise view of an old stale version of each product.
first task within this issue has been shifted to #60, thus closing this issue as other tasks has been accomplished already
Currently among 5 questions 4 are grouping by single column and one by 2 columns. We need more complex grouping queries to group by most of the columns, especially:
id*
columns, and aggregatev1,v2,v3
(process of producing data mart)prerequisite: https://github.com/h2oai/datatable/issues/1082