vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.23k stars 590 forks source link

[BUG-REPORT] Potential memory leak when exporting large strings to hdf5 #2334

Open Ben-Epstein opened 1 year ago

Ben-Epstein commented 1 year ago

Thank you for reaching out and helping us improve Vaex!

Before you submit a new Issue, please read through the documentation. Also, make sure you search through the Open and Closed Issues - your problem may already be discussed or addressed.

Description Please provide a clear and concise description of the problem. This should contain all the steps needed to reproduce the problem. A minimal code example that exposes the problem is very appreciated.

Software information

Additional information If you run this on a limited machine like google colab free, you will get a OOM crash when exporting to hdf5, even though it works fine exporting to arrow. We need to convert the string to a large_string because of pyarrow issues https://issues.apache.org/jira/browse/ARROW-17828

import vaex
import pyarrow as pa

df = vaex.example()
df["text"] = vaex.vconstant("OHYEA"*10000, len(df))

@vaex.register_function()
def to_large(arr):
    return arr.cast(pa.large_string())

df["text"] = df["text"].to_large()
#OOM
df.export("file.hdf5")
Ben-Epstein commented 1 year ago

You should be able to reproduce it here https://colab.research.google.com/drive/1J085UZolLNcaL8zhVKY0LQzbgFMXnYur?usp=sharing

image
Ben-Epstein commented 1 year ago

(Also in the notebook above) When you export to arrow, it works fine, takes about 1 minute, and the memory stays very low

image