h2oai / db-benchmark

reproducible benchmark of database-like ops
https://h2oai.github.io/db-benchmark
Mozilla Public License 2.0
321 stars 85 forks source link

dask memory regression in groupby q10 #176

Closed jangorecki closed 3 years ago

jangorecki commented 3 years ago

Dask seems to do a cross product of categorical columns during groupby, but unlike pandas, it does not let user to disable that. Due to this it is running into out of memory error already using smaller 0.5 GB csv data. Reported upstream: https://github.com/dask/dask/issues/7024

jangorecki commented 3 years ago

resolved by 45ee6d56e6ead953211e348260bac5dc3e3f3e18