vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.28k stars 590 forks source link

[BUG-REPORT] Incorrect(?) filtering after apply operation. #2110

Closed aetherwind closed 2 years ago

aetherwind commented 2 years ago

Description

Code is found below. I wanted to implement a gradient over all rows of my dataframe. I.e. I calculate the difference between the current row and the n-next row after that (or vice versa). I used apply to apply the function. (Not such a good style probably) After the apply I do a filtering operation on the newly created row. This filtering 'fails' (see screenshot). Maybe I missed something (certainly a possiblility).

def calc_grad(x,n):
    n = int(n)
    y=x[n:]-x[:-n]
    for i in np.arange(n):
        y = np.append(y, 0)
    return y

x = np.arange(10)
y=x**2
v = vaex.from_arrays(x=x, y=y)
v['new'] = v.apply(calc_grad, arguments=[v.y,'2'], vectorize=True)
v

dff = v[v.new<30]
dff

Software information

Additional information image image

aetherwind commented 2 years ago

Okay, I understand what happens.

I guess the only workaround is to do a export of the dataframe before filtering?

JovanVeljanoski commented 2 years ago

Replying to your original post.

It is not simple to do window/rolling functions in vaex, which is what you seem to be attempting by looking at your code. We do not provide support for that at the moment. It is a rather complicated area to do efficiently, that requires a lot of effort. We are looking to find partner that is willing to fund the developement for this.

In the meantime, i am sorry but I can't provide more help. You can browsed around the issues to find relevant discussion of the same topic:

Also, please browse the issue board before opening a new thread - it helps the content be more focused, so others can find everything in one place.

aetherwind commented 2 years ago

Thanks very much for the input @JovanVeljanoski and especially for offering help!

I only wanted to mention that I saw some behaviour that was confusing me (from a user perspective, not from a technical point of view):

  1. I create new column
  2. I filter on it
  3. I print the dataframe

What seems to happen and is logical from point of lazy evaluation, but maybe confusing for the user is:

  1. Vaex creates a virtual colum new = x-y
  2. Vaex filters on it (in the example above it drops row 7 from the dataframe)
  3. Vaex again evauates 'new' on the filtered dataframe --> the values in 'new' are now different than they were before the filtering.

This behaviour is different when using df.select() e.g.:

v.select(v.new<30)
print(v.new.sum(selection=True))
# vs.
print(v[v.new<30].new.sum())

or df.shift:

v2 = vaex.from_arrays(x=x,y=y)
v2['ys'] = v2.y
v2 = v2.shift(periods=2,column='ys')
v2['new'] = v2.y-v2.ys
v2
v2[v2.new<30]

Therefore, I was a little stunned when I did see the results after my script had run, as at first the behavour seems inconsistent.

However - no need for action from my perspective.