Open vladmihaisima opened 1 year ago
We generally don't recommend using apply: https://vaex.io/docs/tutorial.html#The-escape-hatch:-apply
In this case you are using normal Python strings, which will probably increase memory usage. Did you try using only the string operations ? https://vaex.io/docs/api.html#string-operations
As apply can be useful in some circumstances, and works as expected most of the times, it would be good to at least have some warning for the cases in which apply can be >1_000_000 times less memory efficient (of course not having such cases would be ideal).
Understanding now that this could arise, and as in this case I can use only string operations I could re-write the code as below. This was not straightforward to develop compared to the apply (for example the cat
in string operations does not mention it would work with a constant, not converting POS to string would result in a non obvious error) and overall is a bit harder to read.
# Prepare test data
import pandas as pd
rows = 150_000
df = pd.DataFrame.from_dict({'REF': [('A'*50_000)]+['G']*rows, 'ALT': ['C']*(rows+1), 'POS': list(range(0,rows+1))})
df.to_parquet("temp.parquet")
import vaex as vx
d = vx.open("temp.parquet")
d['VID'] = d.func.where(~d.REF.isna(),
d.REF.str.cat("-").str.cat(d.POS.astype('str')),
d.POS.astype('str'))
d.export_parquet("temp_out.parquet")
Description When using apply to conditionally concatenate string columns, if one row has a very different column value in terms of length (tens of thousand of times larger than the same columns in other rows), memory usage increases a lot.
I suspect that apply resizes the output buffer and always uses the "largest" buffer possible.
Due to this if you (for example) export on chunks you might get this problem or not depending on where the problematic row appears (at a beginning of chunk or at the end of chunk).
I understand that this might be seen as a performance issue (not a bug), but such an issue can be frustrating when using vaex as one would can get seemingly "random" fails, without any warning. At least a warning "apply increased buffer size by 1000x previous size" would at least hint at the problem.
Software information
import vaex; vaex.__version__)
: {'vaex': '4.12.0', 'vaex-core': '4.12.0', 'vaex-viz': '0.5.3', 'vaex-hdf5': '0.12.3', 'vaex-server': '0.8.1', 'vaex-astro': '0.9.1', 'vaex-jupyter': '0.8.0', 'vaex-ml': '0.18.0'}Additional information Code to reproduce issue: