vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.22k stars 590 forks source link

[FEATURE-REQUEST] Masked joins on multi-dimensional arrays #2278

Open Ben-Epstein opened 1 year ago

Ben-Epstein commented 1 year ago

Thank you for reaching out and helping us improve Vaex!

Before you submit a new Issue, please read through the documentation. Also, make sure you search through the Open and Closed Issues - your request may already be discussed or addressed.

Description I know Vaex doesn't officially support multi-dimensional arrays in columns, but they are incredibly effective and they work very well when you use hdf5 files in vaex. I use them consistently in production.

There is one issue (the first i've found actually) with them. When you use them in a registered function, and then filter a dataframe, the mask created breaks any future joins. It would be great if you could support these types of masked joins for keeping virtual columns!

Here's a working example

import vaex
import numpy as np

df = vaex.from_arrays(
    text= [f"sample {i}" for i in range(100_000)],
    start=np.random.randint(0,500,100_000),
    end=np.random.randint(0,500,100_000),
    valid=np.random.randint(0,2, 100_000),
    probs=np.random.rand(100_000, 500)
)

@vaex.register_function()
def get_prob(probs, idxs):
    return np.take(probs, idxs)

df1 = df[df["valid"]].copy()
df1["join_col"] = df1["text"].astype(str) + "|" + df1["start"].astype(str) + "|" + df1["end"].astype(str) 

new_base = df1.copy() 

df1["prob_val"] = df1.func.get_prob(df1.probs, df1.start)
# df1 = df1.materialize(["prob_val"])

new_base = new_base.join(df1[["prob_val", "join_col"]], on="join_col")
new_base

This will throw a mask error. If you uncomment that 3rd to last line, materializing the virtual column, everything works as expected.

By no means is this crucial, but tracking it here for future work/fun projects!

JovanVeljanoski commented 1 year ago

Hey @Ben-Epstein

Thanks for this!

Quick question - is this any way related to multi-d arrays or it is a question of the virtual column with missing data. I think i've seen something before that is like this.. basically this: https://github.com/vaexio/vaex/pull/1888

If this is the same error as the one exposed in my PR, it will be much better :)
(i.e. if the issue is unrelated to the n-d arrays etc..)

Cheers,

Ben-Epstein commented 1 year ago

@JovanVeljanoski it seems directly related to the multi-d array. Here's an example of nearly the identical scenario but without the multi-d array, just using regular columns

import vaex
import numpy as np

df = vaex.from_arrays(
    text= [f"sample {i}" for i in range(100_000)],
    start=np.random.randint(0,500,100_000),
    end=np.random.randint(0,500,100_000),
    valid=np.random.randint(0,2, 100_000),
    probs=np.random.rand(100_000)#, 500) single-d array
)

@vaex.register_function()
def get_prob(probs, idxs):
    return np.divide(probs, idxs)
#     return np.take(probs, idxs) single-d array

df1 = df[df["valid"]].copy()
df1["join_col"] = df1["text"].astype(str) + "|" + df1["start"].astype(str) + "|" + df1["end"].astype(str) 

new_base = df1.copy() 

df1["prob_val"] = df1.func.get_prob(df1.probs, df1.start)
# df1 = df1.materialize(["prob_val"])

new_base = new_base.join(df1[["prob_val", "join_col"]], on="join_col")
new_base
JovanVeljanoski commented 1 year ago

Ok, I don't understand this completely at this point (and it is getting kinda late here), but here is (another?) workaround:

before running the join operation, do: new_base = new_base.extract(). It will "fix" the mask thing. Might (should?) be more memory efficient perhaps, if that is a requirement.

I will wait for @maartenbreddels to find some time and see if this is a deeper bag or not.

Thanks for the reproducible example!

Ben-Epstein commented 1 year ago

Hey @JovanVeljanoski thanks for taking looks!

I did try extract before opening the issue (should have mentioned that), but extract doesn't work in the case of multi-d arrays. Maybe that is a hint of what the issue is?

The only thing i've found that works is to fullymaterialize the column (either with materialize, evaluate or to_numpy). I use materialize because that's what you told me is the best practice.

While memory is important, in my production case i'm materializing a float32 column, so it's not such a big deal for now. I need to do it for anywhere between 3-10 float32 cols, so not ideal, but not a crazy issue. There's certainly no major rush or blocker on my end for this, just wanted to document :)

I think the bigger loss than memory is actually performance. Since i need to call materialize one at a time to do the join (in my case), I can't keep things lazy and run everything in parallel.