h2oai / db-benchmark

reproducible benchmark of database-like ops
https://h2oai.github.io/db-benchmark
Mozilla Public License 2.0
323 stars 85 forks source link

dask incorrect query results of groupby q7 #81

Closed jangorecki closed 5 years ago

jangorecki commented 5 years ago

head and tail

     id2  id4  range_v1_v2
0  id001    1          NaN
1  id002    1          4.0
2  id001    2          NaN
     id2  id4  range_v1_v2
5  id002    2          NaN
6  id001    1          4.0
7  id002    1          NaN

and

     id2  id4            r2
0  id001    1           NaN
1  id002    1  3.365820e-08
2  id001    2           NaN
     id2  id4            r2
5  id002    2           NaN
6  id001    1  1.046100e-07
7  id002    1           NaN

check why there are NaN there

known issue: https://github.com/dask/dask/issues/4372

jangorecki commented 5 years ago

this issue can be considered as resolved, pandas apply API is an incorrect to apply for dask in case of reduction, as explained in https://github.com/dask/dask/pull/4800 Note a follow up issue https://github.com/h2oai/db-benchmark/issues/86 for proper implementation of those questions.