h2oai / db-benchmark

reproducible benchmark of database-like ops
https://h2oai.github.io/db-benchmark
Mozilla Public License 2.0
322 stars 85 forks source link

not enough memory to read 1e9 data #111

Open jangorecki opened 4 years ago

jangorecki commented 4 years ago

Despite csv file being < 50 GB pandas and dask are unable to successfully read this csv on a 125 GB machine. They both run out of memory. As a result pandas and dask groupby task runs only for 1e7 (0.5 GB) and 1e8 (5 GB) data sizes. My understanding is that root cause is likely the same, memory-inefficient way of how DataFrames stores strings.

jangorecki commented 4 years ago

dask addresses this issue by using on-disk data storage https://github.com/h2oai/db-benchmark/issues/126

jangorecki commented 4 years ago

This is still issue for pandas 1.0.3. For dask I will check that once we will have #144 merged.

jangorecki commented 4 years ago

dask is now capable to load 1e9 after #144.

jangorecki commented 4 years ago

Unfortunately it cannot complete any of the groupby queries, so reverting to use on-disk format again.