[BUG-REPORT] Incorrect(?) filtering after apply operation.

aetherwind commented 2 years ago

Description

Code is found below. I wanted to implement a gradient over all rows of my dataframe. I.e. I calculate the difference between the current row and the n-next row after that (or vice versa). I used apply to apply the function. (Not such a good style probably) After the apply I do a filtering operation on the newly created row. This filtering 'fails' (see screenshot). Maybe I missed something (certainly a possiblility).

def calc_grad(x,n):
    n = int(n)
    y=x[n:]-x[:-n]
    for i in np.arange(n):
        y = np.append(y, 0)
    return y

x = np.arange(10)
y=x**2
v = vaex.from_arrays(x=x, y=y)
v['new'] = v.apply(calc_grad, arguments=[v.y,'2'], vectorize=True)
v

dff = v[v.new<30]
dff

Software information

{'vaex': '4.9.2', 'vaex-core': '4.9.2', 'vaex-viz': '0.5.2', 'vaex-hdf5': '0.12.2', 'vaex-server': '0.8.1', 'vaex-astro': '0.9.1', 'vaex-jupyter': '0.8.0', 'vaex-ml': '0.17.0'}
Installed via pipenv
OS: Ubuntu

Additional information

aetherwind commented 2 years ago

Okay, I understand what happens.

The filtering operation dff = v[v.new<30] gets rid of the original line 7 in the dataframe.
When calling dff now the operation .apply(calc_grad, arguments=[v.y,'2'], vectorize=True) is run on the dataframe which is 'missing line 7'.

I guess the only workaround is to do a export of the dataframe before filtering?

JovanVeljanoski commented 2 years ago

Replying to your original post.

It is not simple to do window/rolling functions in vaex, which is what you seem to be attempting by looking at your code. We do not provide support for that at the moment. It is a rather complicated area to do efficiently, that requires a lot of effort. We are looking to find partner that is willing to fund the developement for this.

In the meantime, i am sorry but I can't provide more help. You can browsed around the issues to find relevant discussion of the same topic:

Also, please browse the issue board before opening a new thread - it helps the content be more focused, so others can find everything in one place.

aetherwind commented 2 years ago

Thanks very much for the input @JovanVeljanoski and especially for offering help!

I only wanted to mention that I saw some behaviour that was confusing me (from a user perspective, not from a technical point of view):

I create new column
I filter on it
I print the dataframe

What seems to happen and is logical from point of lazy evaluation, but maybe confusing for the user is:

Vaex creates a virtual colum new = x-y
Vaex filters on it (in the example above it drops row 7 from the dataframe)
Vaex again evauates 'new' on the filtered dataframe --> the values in 'new' are now different than they were before the filtering.

This behaviour is different when using df.select() e.g.:

v.select(v.new<30)
print(v.new.sum(selection=True))
# vs.
print(v[v.new<30].new.sum())

or df.shift:

v2 = vaex.from_arrays(x=x,y=y)
v2['ys'] = v2.y
v2 = v2.shift(periods=2,column='ys')
v2['new'] = v2.y-v2.ys
v2
v2[v2.new<30]

Therefore, I was a little stunned when I did see the results after my script had run, as at first the behavour seems inconsistent.

However - no need for action from my perspective.

vaexio / vaex

[BUG-REPORT] Incorrect(?) filtering after apply operation. #2110