Open saluto opened 1 year ago
Hey, so without your data (or similar synthetic/fake data) it is hard to comment. I see nothing wrong with your code. I tried to reproduce the issues, with the example data that we include with vaex like this:
import vaex
df_left = vaex.datasets.titanic()
df_left['id'] = vaex.vrange(start=0, stop=len(df_left), dtype='int')
df_right = vaex.datasets.titanic()
df_right['id'] = vaex.vrange(start=0, stop=len(df_right), dtype='int')
df_right['custom'] = (df_right['name'].str.len()/3).astype('int') # For ending of the string
df_left = df_left[['id', 'name']].extract()
df_right = df_right[['id', 'parch', 'custom']]
df = df_left.join(other=df_right, on='id')
df['text'] = df.apply(lambda s, i, j: s[i:j], arguments=[df['name'], df['parch'], df['custom']])
print(df)
Seems to work as expected?
I am using the latest version so you can try updating. Otherwise a reproducible example is necessary I would say..
Description
First, thanks for the impressive and helpful software! We greatly appreciate it!
Please have a look at the following script:
I expect
text == text2
, with equal runtime. But computingtext
is very slow and returns all empty strings (wrong).When using
a = a.head(len(a))
,b = b.head(len(b))
after opening, instead ofab_part.head(len(ab_part))
in the end, the result is correct IFF we use inner join, but it is still equally slow. (Not sure about correctness. Need to test again.)Any ideas what's the reason?
Unfortunately, I cannot share the data. And I couldn't yet generate an artificial example. Let me know if I can help otherwise.
Also, if there is a better way to achieve the above (i.e. slicing texts based on ranges given by columns in different dataframe), I'd be glad to know about it.
Anyway, thank you for your work!
Software information