h2oai / datatable

A Python package for manipulating 2-dimensional tabular data structures
https://datatable.readthedocs.io
Mozilla Public License 2.0
1.81k stars 155 forks source link

fread - OOM issue #3475

Open gautam-ergo opened 1 year ago

gautam-ergo commented 1 year ago

Issue:

+1 for the awesome package.

Encountering Out Of Memory issue when reading a LARGE csv file (~200 million rows) using fread. Datatable code is being run on an aks-cluster with 20Gi of RAM and its still tripping

Code:

image

csv file:

image

Expected behavior: Being able to load the file and perform manipulations.

Environment: datatable version - 1.0.0 python version - 3.8.0 operating system - linux

Any pointers on how to do the aforementioned task in a memory efficient way would be of great help.

Thanks.

Mathanraj-Sharma commented 1 year ago

@gautam-ergo try converting your data to Jay format and load it

https://github.com/h2oai/datatable/issues/2860#issuecomment-783646224

https://stackoverflow.com/questions/57653983/is-the-jay-file-format-specific-to-python-datatable