Add total time to summary table, and more.

mattdowle commented 5 years ago

The new table is very nice! Minor tweaks :

[x] add total time row to the bottom of the summary table, in the same spirit as the total time already at the top of the barplot.
[x] move text "according to the following pattern G1[in-rows][k-cardinality-factor][NA-pct][is-sorted].csv." down to just above the summary table (i.e. where the reader needs it). Or, even better, split the first coded column into separate columns: [rows, K, nas, issorted] so the pattern isn't needed to be understood by the reader of the page. Also use 2 not 2e0 and 10 rather than 1e1, to make it easier and neather. To avoid adding a note explaining what K is, change "K" to "group size (rows)" in the column heading.
[x] add a blank line between the barplot image and the table.
[x] what does the 1 mean in "G1" ? It isn't question 1 because all 5 questions appear in the question column but all rows in the table are marked G1.
[x] is the reason pandas missing because 1e9 data won't load? Is there any hope in getting it to load from binary file? I imagine a common thought from readers of this summary table will be -- what about pandas. Another approach might be to reduce the size from 1e9 to 5e8 so pandas can appear there. Since the size is a constant in that summary table (1e9), it could be a different constant (5e8). That table is concerned with varying things other than size, so it doesn't need to be largest size. Reducing to 5e8 would also help with overal runtime of db-bench. (Still keep the barplots at 1e9, assuming the data load problem can be overcome.) At the very least, why pandas isn't in the summary table should be explained as a note somewhere on the page.

jangorecki commented 5 years ago

what does the 1 mean in "G1" ? It isn't question 1 because all 5 questions appear in the question column but all rows in the table are marked G1.

I don't remember at which point name of dataset G1_N_K.csv appeared. According to my intuition it stands for "first" dataset. G for "groupby".

jangorecki commented 5 years ago

is the reason pandas missing because 1e9 data won't load? Is there any hope in getting it to load from binary file? I imagine a common thought from readers of this summary table will be -- what about pandas

feather and jay didn't helped, there is still pickle to try but I don't think it will change anything, #47 will have up to date status of that.

Reduce the size from 1e9 to 5e8 so pandas can appear there.

unless it won't handle that too. Pandas is there on 1e8. I don't think we should make such a big exception just because of memory inefficiency on pandas. We already dropped 2e9 dataset to ease with issues like this. As you said before, if something is not able to do it with memory constraints we have (125GB) it should be marked as failure. They are going to release 0.24.0 soon, maybe that will help.

At the very least, why pandas isn't in the summary table should be explained as a note somewhere on the page.

good question, I could add it as column of NAs, but on the other hand it is a waste of space. @mattdowle let me know if you want NAs columns there, if not we can close this issue already.

jangorecki commented 5 years ago

add total time row to the bottom of the summary table, in the same spirit as the total time already at the top of the barplot.

Two levels of totals added, sub total by each dataset, and grand total for all. Using rollup from groupingsets family.

mattdowle commented 5 years ago

@jangorecki you wrote I said :

As you said before, if something is not able to do it with memory constraints we have (125GB) it should be marked as failure.

Yes that's right. But we don't have to stick to the largest size all the way through everything. It should be marked as failure somewhere (e.g the barplots show 1e9 and a fail there for pandas and that should not be changed). But once db-bench has made its point, it shouldn't labor that point; it can do a smaller size in the summary table.

I don't think we should make such a big exception just because of memory inefficiency on pandas.

I think we should for the summary table please (reduce n to 5e8 or 1e8 for that table). Not for the barplot though : that should keep 1e9 size. Then pandas can appear in the table and we gain some insight. Currently we are not gaining as much insight because pandas is missing. You can write as a comment that the smaller 5e8 size (or 1e8 size) is chosen to accommodate pandas.

Or, have two summary tables of course: both 1e8 (or 5e8) and 1e9. But I just had limited resources in mind by suggesting using 1e8 (or 5e8) instead.

(Aside: it's not really a "summary" table is it. It's not a summary of the barplot results for instance, is it. It's more of a detail table; exploring K and issorted parameters. Not sure a good name for that. "Stress-test" table since K is taken down to 2, for example?)

jangorecki commented 5 years ago

It should be marked as failure somewhere

done

reduce n to 5e8 or 1e8 for that table... Or, have two summary tables of course: both 1e8 (or 5e8) and 1e9.

It is exactly what we have now when you switch tab to 5GB, 1e8 has own table.

it's not really a "summary"

renamed to "full timings table"

If we want to add 5e8 data size, lets make new issue for that, as it requires changes in multiple places, which are irrelevant to report improvement discussed here. As user can access tables for 1e8 easily and it is explained why solutions might be missing in the table I think it is not that much required to have 5e8. I checked python and it handle 1e9 k=100 and k=10. k=2 will finish but will exceed 1h of timings just with benchmarks, not even taking loading data time into account.

h2oai / db-benchmark

Add total time to summary table, and more. #55