h2oai / db-benchmark

reproducible benchmark of database-like ops
https://h2oai.github.io/db-benchmark
Mozilla Public License 2.0
325 stars 88 forks source link

pandas (and dask) cannot yet handle NAs during groupby #171

Open jangorecki opened 3 years ago

jangorecki commented 3 years ago

dropna argument has been added to pandas groupby in 0.24.0 but till now it does not yet support categorical fields. It silently produces incorrect answer. Data case having NAs (i.e. G1_1e7_1e2_5_0) will have to be escape for now for those two solutions. We will enable it once https://github.com/pandas-dev/pandas/issues/36327 will be resolved.

jangorecki commented 3 years ago

In case of dask, it wasn't yet implemented at all, not just for categorical type. https://github.com/dask/dask/issues/6986