h2oai / db-benchmark

reproducible benchmark of database-like ops
https://h2oai.github.io/db-benchmark
Mozilla Public License 2.0
323 stars 85 forks source link

pandas/dask try to optimise read_csv to load 1e9 rows data #99

Closed jangorecki closed 5 years ago

jangorecki commented 5 years ago

as documented in https://github.com/h2oai/db-benchmark/issues/111 currently pandas and dask fails on 1e9 data when attempting to read csv. try those, maybe will help https://stackoverflow.com/a/27232309/2490497 https://www.dataquest.io/blog/pandas-big-data/

jangorecki commented 5 years ago

Did not helped for pandas. Tried dtype, engine="c", low_memory=True. See https://github.com/pandas-dev/pandas/issues/22194 for more information What might help in future is https://github.com/h2oai/datatable/issues/1691