vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.22k stars 589 forks source link

Use of too little memory #2372

Open zincopper opened 1 year ago

zincopper commented 1 year ago

I have 180G memory and want to process a 400G dataset.

When I use from_csv_arrow, with lazy=True, chunk_size="10GiB", newline_readahead="640MiB", vaex only uses around 2G memory, which causes the processing really slow.

If I read all data into memory the computation will be extremely fast. I have tried 40G dataset, that's alright, but I cannot read 400G into memory and vaex seems to fail to take advantage of the memory.

Am I wrong with some configuration? What should I do, I have stuck at this problem and really need your help.