vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.22k stars 590 forks source link

[BUG-REPORT] Slow and wrong computation when using join and apply #2298

Open saluto opened 1 year ago

saluto commented 1 year ago

Description

First, thanks for the impressive and helpful software! We greatly appreciate it!

Please have a look at the following script:

import vaex as vx
a = vx.open("a.arrow")  # ca. 40M rows, 10 columns (incl. "idx", "start_char_idx", "end_char_idx")
b = vx.open("b.arrow")  # ca. 100K rows, 80 columns (incl. "id", "full_text")
b["idx"] = vx.vrange(0, len(a), dtype="int64")
ab = a.join(b, "idx")  # Relation is N:1. Issue exists with both `how="left"` and `how="inner"`.
ab["text"] = ab.apply(lambda s, i, j: s[i:j], [ab.full_text, ab.start_char_idx, ab.end_char_idx])
ab_part = ab[ab.id == "some_id"]  # Result has ca. 500 rows.
texts = ab_part.text.tolist()  # runtime: ca. 5 minutes, result: incorrect (all empty strings)
texts2 = ab_part.head(len(ab_part)).text.tolist()  # runtime: ca. 2 seconds, result: correct strings

I expect text == text2, with equal runtime. But computing text is very slow and returns all empty strings (wrong).

When using a = a.head(len(a)), b = b.head(len(b)) after opening, instead of ab_part.head(len(ab_part)) in the end, the result is correct IFF we use inner join, but it is still equally slow. (Not sure about correctness. Need to test again.)

Any ideas what's the reason?

Unfortunately, I cannot share the data. And I couldn't yet generate an artificial example. Let me know if I can help otherwise.

Also, if there is a better way to achieve the above (i.e. slicing texts based on ranges given by columns in different dataframe), I'd be glad to know about it.

Anyway, thank you for your work!

Software information

JovanVeljanoski commented 1 year ago

Hey, so without your data (or similar synthetic/fake data) it is hard to comment. I see nothing wrong with your code. I tried to reproduce the issues, with the example data that we include with vaex like this:

import vaex

df_left = vaex.datasets.titanic()
df_left['id'] = vaex.vrange(start=0, stop=len(df_left), dtype='int')

df_right = vaex.datasets.titanic()
df_right['id'] = vaex.vrange(start=0, stop=len(df_right), dtype='int')
df_right['custom'] = (df_right['name'].str.len()/3).astype('int')  # For ending of the string

df_left = df_left[['id', 'name']].extract()
df_right = df_right[['id', 'parch', 'custom']]

df = df_left.join(other=df_right, on='id')
df['text'] = df.apply(lambda s, i, j: s[i:j], arguments=[df['name'], df['parch'], df['custom']])

print(df)

Seems to work as expected?

image

I am using the latest version so you can try updating. Otherwise a reproducible example is necessary I would say..