consider removing K=2 case dataset

jangorecki commented 5 years ago

As of now the most unbalanced cardinality test (K=2) is failing for most of the tools

pandas - fails due to data size, not K=2
dask - fails due to data size, not K=2
juliadf - fails after first question or on reading data
dplyr - fails after second question
data.table - fails during q3
clickhouse - fails during q3

It is quite likely that even if pandas and dask would read those data they would fail on question 3 as they doesn't scale well in cardinality.

Unless we will define exceptions in launcher script, on each upgrade of those tools (or "force run") there is an attempt to run those scripts for those tools.

Currently three tools that deals successfully on 1e9 k=2 spent in total 4 hours while three other datasets (k=1e2, k=1e1 and k=1e2 sorted) in total takes 5 for those tools.

jangorecki commented 5 years ago

it is worth to note that due to this extreme unbalanced dataset we are having tons of exceptions handling in benchplot functions, simply because there are so few results for 1e9 k=2, just another one today: https://github.com/h2oai/db-benchmark/commit/cf21830c8926ee37696c1589f6aff6379e2449b8

jangorecki commented 5 years ago

clickhouse mergetree table engine also is being killed on q3 for 1e9 k=2

jangorecki commented 5 years ago

As agreed with Matt before we can close this one. It is useful to know if/when the issue will be handled. So, lets keep k=2 factor despite it complicates handling all the tools.

h2oai / db-benchmark

consider removing K=2 case dataset #65