vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.31k stars 591 forks source link

[Q]How large csv file the vaex can open? #1531

Closed akeliduo closed 3 years ago

akeliduo commented 3 years ago

When using vaex.open(filename), I get a Memory Error. My question is What is the size of the largest file that vaex can open? And What should I do if I don't want to use chunk to open it? Is is possible not to use chunk? Thanks.

kmcentush commented 3 years ago

If the file is a CSV, vaex uses pandas under the hood to load. So the memory limit would be whatever pandas dictates. Vaex shines in file types like parquet, hdf5, etc than can be read/transformed in small chunks as opposed to requiring the entire file being read into memory first.

I suggest converting the CSV file (in chunks) into an HDF5, Parquet, etc. file, then loading it with vaex.

JovanVeljanoski commented 3 years ago

Keep in mind, chunk size in vaex works differently than in pandas. In pandas it gives you a generator so you can loop over portions of data(frames), but in vaex it is used as a sample size to load intermediate data, convert it to hdf5 or arrow, so then you can work with the whole data easily.

The rest is as @kmcentush said.

akeliduo commented 3 years ago

Thank you all.