[BUG-REPORT] Potential memory leak when exporting large strings to hdf5

Ben-Epstein commented 1 year ago

Thank you for reaching out and helping us improve Vaex!

Before you submit a new Issue, please read through the documentation. Also, make sure you search through the Open and Closed Issues - your problem may already be discussed or addressed.

Description Please provide a clear and concise description of the problem. This should contain all the steps needed to reproduce the problem. A minimal code example that exposes the problem is very appreciated.

Software information

Vaex version (import vaex; vaex.__version__): 4.16.1
Vaex was installed via: pip / conda-forge / from source pip
OS: Linux (colab)

Additional information If you run this on a limited machine like google colab free, you will get a OOM crash when exporting to hdf5, even though it works fine exporting to arrow. We need to convert the string to a large_string because of pyarrow issues https://issues.apache.org/jira/browse/ARROW-17828

import vaex
import pyarrow as pa

df = vaex.example()
df["text"] = vaex.vconstant("OHYEA"*10000, len(df))

@vaex.register_function()
def to_large(arr):
    return arr.cast(pa.large_string())

df["text"] = df["text"].to_large()
#OOM
df.export("file.hdf5")

Ben-Epstein commented 1 year ago

You should be able to reproduce it here https://colab.research.google.com/drive/1J085UZolLNcaL8zhVKY0LQzbgFMXnYur?usp=sharing

Ben-Epstein commented 1 year ago

(Also in the notebook above) When you export to arrow, it works fine, takes about 1 minute, and the memory stays very low

vaexio / vaex

[BUG-REPORT] Potential memory leak when exporting large strings to hdf5 #2334