vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.23k stars 590 forks source link

[BUG-REPORT] to_pandas_df throws exception on filtered data frame on virtual column, with empty result #2232

Closed vladmihaisima closed 1 year ago

vladmihaisima commented 1 year ago

Description to_pandas_df throws exception when invoked on empty filter. This happens only when there is a virtual column with a transformation (it will not happen if col1 is not modified see code below)

Software information

Additional information Code to reproduce. First two prints work, third one throws exception.

import pandas as pd
import vaex as vx

df = vx.from_pandas(pd.DataFrame(data={'col1':['chr1','chr2'],'col2':[3,4]}))
print(df[df['col1']=='3'].extract().to_pandas_df()) # Works
df['col1'] = df.col1.astype('str').str.replace('^chr','',regex=True)
print(df[df['col1']=='3'].extract()) # Works
print(df[df['col1']=='3'].extract().to_pandas_df()) # Throws exception

results in: AssertionError for assert filter.sum() == expected_length, in dataset.py, line 1018, in slice

JovanVeljanoski commented 1 year ago

Thanks for the report. This seems to be an edge case. I needed to convince myself that it is a bug.

Do you understand why it happens?

Basically, you do not need to call .extract() if you plan to export the data to pandas or any other in-memory representation. So if you remove the .extract() part of your examples, things will work as expected.

Still we need to address this. I wonder whether doing a .extract() on an empty dataframe makes sense and whether it should be allowed at all..? @maartenbreddels

vladmihaisima commented 1 year ago

The above code is a minimal reproducible example. I got the problem in a code base in which the vaex DataFrame was passed around between functions, so at the point .to_pandas_df() was invoked there was no easy way to know what was the last operation performed. In .extract() documentation (https://vaex.readthedocs.io/en/latest/_modules/vaex/dataframe.html#DataFrame.extract) there is a mention The resulting DataFrame may be more efficient to work with when the original DataFrame is heavily filtered which made sense for our code and worked most of the times, just that at some point the filtering was so heavy that the result was empty.

maartenbreddels commented 1 year ago

This is difficult to fix, unless we refactor how we handle the schema. It is implicit now, and on demand. I think if we passed down the schema we could fix this, but I'm not sure what that would do. If would say, the workaround for now is

df = df.extract() if len(df) else df
maartenbreddels commented 1 year ago

FYI, we will raise an error in the future: https://github.com/vaexio/vaex/pull/2267

JovanVeljanoski commented 1 year ago

We might revisit this sometime down the line, but for now we consider it done. The error message is quite informative I think.

Ben-Epstein commented 1 year ago

@maartenbreddels @JovanVeljanoski this is a new issue, and is not reproducible in vaex 4.12.. I'm only seeing it in 4.15 in fact.

It seems as though extract on an empty dataframe used to work. It might be worth reopening and seeing what changed?

JovanVeljanoski commented 1 year ago

Did it work in the case of virtual columns (that are not materialized etc.). I don't really see how it could have worked.. since we do not know the dtype before evaluation (if I am not mistaken)

maartenbreddels commented 1 year ago

Ok, so, with an extract that dataframe is a bit in limbo, some things can work, depending if you use the virtual column or not. I think #2284 is the better way for now.