vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.22k stars 590 forks source link

[BUG-REPORT] using apply with certain data allocates lots of memory #2304

Open vladmihaisima opened 1 year ago

vladmihaisima commented 1 year ago

Description When using apply to conditionally concatenate string columns, if one row has a very different column value in terms of length (tens of thousand of times larger than the same columns in other rows), memory usage increases a lot.

I suspect that apply resizes the output buffer and always uses the "largest" buffer possible.

Due to this if you (for example) export on chunks you might get this problem or not depending on where the problematic row appears (at a beginning of chunk or at the end of chunk).

I understand that this might be seen as a performance issue (not a bug), but such an issue can be frustrating when using vaex as one would can get seemingly "random" fails, without any warning. At least a warning "apply increased buffer size by 1000x previous size" would at least hint at the problem.

Software information

Additional information Code to reproduce issue:

# Prepare test data
import pandas as pd
rows = 150_000
df = pd.DataFrame.from_dict({'REF': [('A'*50_000)]+['G']*rows, 'ALT': ['C']*(rows+1), 'POS': list(range(0,rows+1))})
df.to_parquet("temp.parquet")

# Following code uses between 20GB and 30GB to generate a 1MB file from a 1MB input
import vaex as vx
d = vx.open("temp.parquet")
d['VID'] = d.apply(lambda r,a,p: f"{r}-{a}" if r is not None and a is not None else f"{p}", arguments=[d.REF, d.ALT, d.POS], multiprocessing=False)
d.export_parquet("temp_out.parquet")

# Workaround: dump two parquet (such that the long REF row is processed separately)
import vaex as vx
d = vx.open("temp.parquet")
d['VID'] = d.apply(lambda r,a,p: f"{r}-{a}" if r is not None and a is not None else f"{p}", arguments=[d.REF, d.ALT, d.POS], multiprocessing=False)
d[0:1].export_parquet("temp_out_chunk1.parquet")
d[1:].export_parquet("temp_out_chunk2.parquet")
d = vx.open(["temp_out_chunk1.parquet","temp_out_chunk2.parquet"])
d.export_parquet("temp_out_chunk.parquet")
maartenbreddels commented 1 year ago

We generally don't recommend using apply: https://vaex.io/docs/tutorial.html#The-escape-hatch:-apply

In this case you are using normal Python strings, which will probably increase memory usage. Did you try using only the string operations ? https://vaex.io/docs/api.html#string-operations

vladmihaisima commented 1 year ago

As apply can be useful in some circumstances, and works as expected most of the times, it would be good to at least have some warning for the cases in which apply can be >1_000_000 times less memory efficient (of course not having such cases would be ideal).

Understanding now that this could arise, and as in this case I can use only string operations I could re-write the code as below. This was not straightforward to develop compared to the apply (for example the cat in string operations does not mention it would work with a constant, not converting POS to string would result in a non obvious error) and overall is a bit harder to read.

# Prepare test data
import pandas as pd
rows = 150_000
df = pd.DataFrame.from_dict({'REF': [('A'*50_000)]+['G']*rows, 'ALT': ['C']*(rows+1), 'POS': list(range(0,rows+1))})
df.to_parquet("temp.parquet")

import vaex as vx
d = vx.open("temp.parquet")
d['VID'] = d.func.where(~d.REF.isna(), 
        d.REF.str.cat("-").str.cat(d.POS.astype('str')), 
        d.POS.astype('str'))
d.export_parquet("temp_out.parquet")