Closed vladmihaisima closed 1 year ago
Thanks for the report. This seems to be an edge case. I needed to convince myself that it is a bug.
Do you understand why it happens?
Basically, you do not need to call .extract()
if you plan to export the data to pandas or any other in-memory representation. So if you remove the .extract()
part of your examples, things will work as expected.
Still we need to address this.
I wonder whether doing a .extract()
on an empty dataframe makes sense and whether it should be allowed at all..?
@maartenbreddels
The above code is a minimal reproducible example. I got the problem in a code base in which the vaex DataFrame was passed around between functions, so at the point .to_pandas_df()
was invoked there was no easy way to know what was the last operation performed. In .extract()
documentation (https://vaex.readthedocs.io/en/latest/_modules/vaex/dataframe.html#DataFrame.extract) there is a mention The resulting DataFrame may be more efficient to work with when the original DataFrame is heavily filtered which made sense for our code and worked most of the times, just that at some point the filtering was so heavy that the result was empty.
This is difficult to fix, unless we refactor how we handle the schema. It is implicit now, and on demand. I think if we passed down the schema we could fix this, but I'm not sure what that would do. If would say, the workaround for now is
df = df.extract() if len(df) else df
FYI, we will raise an error in the future: https://github.com/vaexio/vaex/pull/2267
We might revisit this sometime down the line, but for now we consider it done. The error message is quite informative I think.
@maartenbreddels @JovanVeljanoski this is a new issue, and is not reproducible in vaex 4.12.. I'm only seeing it in 4.15 in fact.
It seems as though extract on an empty dataframe used to work. It might be worth reopening and seeing what changed?
Did it work in the case of virtual columns (that are not materialized etc.). I don't really see how it could have worked.. since we do not know the dtype before evaluation (if I am not mistaken)
Ok, so, with an extract that dataframe is a bit in limbo, some things can work, depending if you use the virtual column or not. I think #2284 is the better way for now.
Description to_pandas_df throws exception when invoked on empty filter. This happens only when there is a virtual column with a transformation (it will not happen if col1 is not modified see code below)
Software information
import vaex; vaex.__version__)
: {'vaex': '4.12.0', 'vaex-core': '4.12.0', 'vaex-viz': '0.5.3', 'vaex-hdf5': '0.12.3', 'vaex-server': '0.8.1', 'vaex-astro': '0.9.1', 'vaex-jupyter': '0.8.0', 'vaex-ml': '0.18.0'}Additional information Code to reproduce. First two prints work, third one throws exception.
results in: AssertionError for assert filter.sum() == expected_length, in dataset.py, line 1018, in slice