Open Ben-Epstein opened 1 year ago
Hey @Ben-Epstein
Thanks for this!
Quick question - is this any way related to multi-d arrays or it is a question of the virtual column with missing data. I think i've seen something before that is like this.. basically this: https://github.com/vaexio/vaex/pull/1888
If this is the same error as the one exposed in my PR, it will be much better :)
(i.e. if the issue is unrelated to the n-d arrays etc..)
Cheers,
@JovanVeljanoski it seems directly related to the multi-d array. Here's an example of nearly the identical scenario but without the multi-d array, just using regular columns
import vaex
import numpy as np
df = vaex.from_arrays(
text= [f"sample {i}" for i in range(100_000)],
start=np.random.randint(0,500,100_000),
end=np.random.randint(0,500,100_000),
valid=np.random.randint(0,2, 100_000),
probs=np.random.rand(100_000)#, 500) single-d array
)
@vaex.register_function()
def get_prob(probs, idxs):
return np.divide(probs, idxs)
# return np.take(probs, idxs) single-d array
df1 = df[df["valid"]].copy()
df1["join_col"] = df1["text"].astype(str) + "|" + df1["start"].astype(str) + "|" + df1["end"].astype(str)
new_base = df1.copy()
df1["prob_val"] = df1.func.get_prob(df1.probs, df1.start)
# df1 = df1.materialize(["prob_val"])
new_base = new_base.join(df1[["prob_val", "join_col"]], on="join_col")
new_base
Ok, I don't understand this completely at this point (and it is getting kinda late here), but here is (another?) workaround:
before running the join operation, do: new_base = new_base.extract()
. It will "fix" the mask thing. Might (should?) be more memory efficient perhaps, if that is a requirement.
I will wait for @maartenbreddels to find some time and see if this is a deeper bag or not.
Thanks for the reproducible example!
Hey @JovanVeljanoski thanks for taking looks!
I did try extract
before opening the issue (should have mentioned that), but extract doesn't work in the case of multi-d arrays. Maybe that is a hint of what the issue is?
The only thing i've found that works is to fullymaterialize
the column (either with materialize
, evaluate
or to_numpy
). I use materialize
because that's what you told me is the best practice.
While memory is important, in my production case i'm materializing a float32 column, so it's not such a big deal for now. I need to do it for anywhere between 3-10 float32 cols, so not ideal, but not a crazy issue. There's certainly no major rush or blocker on my end for this, just wanted to document :)
I think the bigger loss than memory is actually performance. Since i need to call materialize
one at a time to do the join (in my case), I can't keep things lazy and run everything in parallel.
Thank you for reaching out and helping us improve Vaex!
Before you submit a new Issue, please read through the documentation. Also, make sure you search through the Open and Closed Issues - your request may already be discussed or addressed.
Description I know Vaex doesn't officially support multi-dimensional arrays in columns, but they are incredibly effective and they work very well when you use hdf5 files in vaex. I use them consistently in production.
There is one issue (the first i've found actually) with them. When you use them in a registered function, and then filter a dataframe, the mask created breaks any future joins. It would be great if you could support these types of masked joins for keeping virtual columns!
Here's a working example
This will throw a mask error. If you uncomment that 3rd to last line, materializing the virtual column, everything works as expected.
By no means is this crucial, but tracking it here for future work/fun projects!