Add quick check that result is correct too

h2oai / db-benchmark

reproducible benchmark of database-like ops

https://h2oai.github.io/db-benchmark

Mozilla Public License 2.0

323 stars 85 forks source link

Add quick check that result is correct too #58

Closed mattdowle closed 5 years ago

mattdowle commented 5 years ago

As far I understand, the nrows and ncols of the result from each product is checked, but it's possible currently that the result's data is not correct. With multi-threading and more complex code, incorrect results are more possible, especially since we are testing latest versions from dev of each product. Occasionally the right dimension might be returned but the contents all-NA or all messed up.

Checking just 3 fixed rows (near beginning, middle and end) should be a trivial run time to compute and would ensure the result has been materialized too. A very clever delayed-evaluation might be able to only calculate those 3 rows and delay the rest. If any do we can think about that later. Include the time to check those 3 rows in the time for the task. Since we allow each product to return the results in slightly different way (ordering the groups or not, etc), then the 3-row test may be slightly different for each product, which is ok.

jangorecki commented 5 years ago

We log number of rows and columns of an answers. We do also check grand totals of each measure we aggregate. This was done not to validate results but to ensure that queries are being evaluated when they should.

A very clever delayed-evaluation might be able to only calculate those 3 rows and delay the rest.

this should not be problem because extra check would be performed outside of timing section

I would go with head 3 and tail 3 from an answer after each question's second run.

jangorecki commented 5 years ago

done, except tail for spark, due to https://issues.apache.org/jira/browse/SPARK-26433, we will check only head for spark

jangorecki commented 5 years ago

verified on 1e8 1e2 0 0 (Rscript groupby-datagen.R 1e8 1e2 0 0) answer values produced by different solutions matches each other.

some useful notes for future validations

retains order of groups:

data.table
juliadf

returns groups in sorted order:

dask*
dplyr
pandas
pydatatable

returns in random order:

spark

important notes:

dask - only question 3 results are neither sorted nor original order, values are ok (fyi @mrocklin)
spark verified only head of results, as of spark 2.4.0 tail method is not yet implemented https://issues.apache.org/jira/browse/SPARK-26433

mattdowle commented 5 years ago

It's great that the total of each column in the result is checked correct. That's a very nice idea and as you say, is a nice way to ensure the calculation has not been deferred by the product. I didn't know you'd added this check.

So I've reopened this issue to just state this somewhere on the main page! It's a good feature that conveys care has been taken over the benchmarks. Please add at the bottom in a notes section either above or below the configuration section. I'm sure there's other notes that we'll think of to add in future.

jangorecki commented 5 years ago

I had to add together with timing of calculating totals because distributed solutions (old db-benchmark) were more lazy. So the chk (concatenated total sums of aggregated columns) is logged, as well as timing to calculate that, but validation of total is not plugged in any continuously running script. It was run interactively using https://github.com/h2oai/db-benchmark/blob/master/answers-validation.R which is now outdated. I will

[x] update answers-validation.R
[x] validate totals
[x] validate timing calculations of totals (should be much smaller than corresponding aggregate query)
[x] mention that on report