h2oai / db-benchmark

reproducible benchmark of database-like ops
https://h2oai.github.io/db-benchmark
Mozilla Public License 2.0
322 stars 85 forks source link

join questions should ensure answers are materialized #141

Closed jangorecki closed 4 years ago

jangorecki commented 4 years ago

All 5 basic questions that are now defined as join task are not effectively forcing all computations to be finished. We do print nrow and ncol of the answer from each question. This unfornately is not enough to enforce answer to be materialized. To know the nrow of the answer it is enough to compute matching rows, not necessarily performing the join of both datasets. Ncol is obvious just from the query, not even looking at the data. As a result we should ensure that such optimization is not taking place, by either using API of a solution to force that part of computation, or by changing the queries to include an extra computation that actually requires data to be materialized. Extra computation could be either artificial one (head and tail) or more real-life use case of data after joining. The latter one will cause a problem due to the fact that such a real-life computation will blury the join timing, in some cases likely to heavily diverge reported timing from the actual joining timing. Thus IMO the best way would be to force computation via API of a solution.

jangorecki commented 4 years ago

We so far ensured results are materialized for cudf, data.table and pydatatable. In case of pydatatable, extra overhead needed to be imposed, which will need to revert once https://github.com/h2oai/datatable/issues/2443 will be resolved. This issue can be considered resolved now. We will come back to it if there will be a need.