vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.31k stars 591 forks source link

[BUG-REPORT] Cannot add virtual column that is output of different dataframe #1661

Closed Ben-Epstein closed 3 years ago

Ben-Epstein commented 3 years ago

Thank you for reaching out and helping us improve Vaex!

Description If I have 2 dataframes, df1 and df2 and I want the output of an expression of df1 to be applied to df2, I currently cannot.

import vaex
import numpy as np

ids = list(range(100))
data1 = {
    'a': np.random.random(100),
    'id': ids
}
df1 = vaex.from_arrays(**data1)

data2 = {
    'b': np.random.random(100),
    'id': ids
}
df2 = vaex.from_arrays(**data2)

@vaex.register_function()
def add1(v):
    return v+1

df2.add_virtual_column('c',df1.a.add1())
df2.c.to_numpy(). # NameError: name 'a' is not defined

If I run the same thing but change that second to last line to

df2.c = df1.a.add1())

this works but not as expected. Upon an export, the resulting value is gone, and a print of df2.virtual_columns is empty. I think it's attaching it as a simple python attribute, not an actual vaex column.

I also tried something like

df1['c'] = df1.a.add1()
df2.add_virtual_column('c',df1['c'])
df2

but that gave me a massive stack trace ending in

RecursionError: maximum recursion depth exceeded

same with

df2.add_virtual_column('a',df1['a'])
df2['c'] = df2['a'].add1()
df2

The only solution I can think to do for now is something like

df2['c'] = df1['a'].add1().to_numpy() 

which is of course not ideal.

Do you have any suggestions/workarounds?

Thanks!

Software information

yohplala commented 3 years ago

Hi,

To mix 2 distinct data from 2 vaex DataFrames, you have 1st to join them. "DataFrames" is in my opinion, and somehow, misleading the user in vaex world, or so I think. You should see vaex "DataFrames" as a list of pending commands to be run 'when it will be time'. These 'commands' are "Expression". An "Expression" is only relative to one "DataFrame". It has no meaning in other "DataFrame".

And as stated in add_virtual_column(), the 2nd expected parameter is an "Expression".

So before anything. df2 = df2.join(df1)

Bests,

JovanVeljanoski commented 3 years ago

In Vaex, DataFrames are sort of "islands", they do not really interact with each other. Kind of how tables in an SQL database do not interact with each other.

So indeed, as @yohplala said, join is your best bet. You can also join without providing any key, and in that case 2 dataframes will just be put next to each other. That is convenient if you know for example that the ordering is the same, since it is super fast (basically as fast as if you had one big dataframe).

You can also add single rows to a dataframe, but they need to be in memory structures. So you need to do something like external_column = df1['col1'].values and then df2[external_column] = external_column. Keep in mind that if you go for this approach, the length of the external column should be the same as the unfiltered length of the target dataframe. You can also see this.

The approach in the OP will not work for the reasons @yohplala already stated. You can think of expressions in vaex as a mathematical expression a+b. It is stored as such (as a formula or a command) until it needs to be executed. Sometimes we call "columns" those data that do exist on disk ready to use (or in memory). But expression more general, as it can just point to data that is in memory or on disk, or can be a mathematical expression ready to be executed to get the results.

Also for small(?) questions, espcially use based, maybe you can join slack?

Ben-Epstein commented 3 years ago

@JovanVeljanoski I'd love to join the slack. I didn't know there was one, where is it mentioned?