Closed jangorecki closed 4 years ago
We so far ensured results are materialized for cudf, data.table and pydatatable. In case of pydatatable, extra overhead needed to be imposed, which will need to revert once https://github.com/h2oai/datatable/issues/2443 will be resolved. This issue can be considered resolved now. We will come back to it if there will be a need.
All 5 basic questions that are now defined as
join
task are not effectively forcing all computations to be finished. We do printnrow
andncol
of the answer from each question. This unfornately is not enough to enforce answer to be materialized. To know the nrow of the answer it is enough to compute matching rows, not necessarily performing the join of both datasets. Ncol is obvious just from the query, not even looking at the data. As a result we should ensure that such optimization is not taking place, by either using API of a solution to force that part of computation, or by changing the queries to include an extra computation that actually requires data to be materialized. Extra computation could be either artificial one (head
andtail
) or more real-life use case of data after joining. The latter one will cause a problem due to the fact that such a real-life computation will blury the join timing, in some cases likely to heavily diverge reported timing from the actual joining timing. Thus IMO the best way would be to force computation via API of a solution.