vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.25k stars 590 forks source link

[BUG-REPORT] Filtering breaks negative indexing #2153

Closed karotchykau closed 2 years ago

karotchykau commented 2 years ago

Description The following code

import pandas as pd
import vaex

p_df = pd.DataFrame({"A": ["abc"] * 100})
df = vaex.from_pandas(p_df)
f_df = df[df["A"] == "abc"]

f_df[99]  # Works fine.
f_df[-1]  # Throws an error (same for any negative number).

throws

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Input In [3], in <cell line: 6>()
      3 f_df = df[df["A"] == "abc"]
      5 f_df[99]  # Works fine.
----> 6 f_df[-1]

File ~/mambaforge/envs/tmp_env/lib/python3.9/site-packages/vaex/dataframe.py:5337, in DataFrame.__getitem__(self, item)
   5335 if isinstance(item, int):
   5336     names = self.get_column_names()
-> 5337     return [self.evaluate(name, item, item+1, array_type='python')[0] for name in names]
   5338 elif isinstance(item, six.string_types):
   5339     if hasattr(self, item) and isinstance(getattr(self, item), Expression):

File ~/mambaforge/envs/tmp_env/lib/python3.9/site-packages/vaex/dataframe.py:5337, in <listcomp>(.0)
   5335 if isinstance(item, int):
   5336     names = self.get_column_names()
-> 5337     return [self.evaluate(name, item, item+1, array_type='python')[0] for name in names]
   5338 elif isinstance(item, six.string_types):
   5339     if hasattr(self, item) and isinstance(getattr(self, item), Expression):

File ~/mambaforge/envs/tmp_env/lib/python3.9/site-packages/vaex/dataframe.py:3090, in DataFrame.evaluate(self, expression, i1, i2, out, selection, filtered, array_type, parallel, chunk_size, progress)
   3088     return self.evaluate_iterator(expression, s1=i1, s2=i2, out=out, selection=selection, filtered=filtered, array_type=array_type, parallel=parallel, chunk_size=chunk_size, progress=progress)
   3089 else:
-> 3090     return self._evaluate_implementation(expression, i1=i1, i2=i2, out=out, selection=selection, filtered=filtered, array_type=array_type, parallel=parallel, chunk_size=chunk_size, progress=progress)

File ~/mambaforge/envs/tmp_env/lib/python3.9/site-packages/vaex/dataframe.py:6362, in DataFrameLocal._evaluate_implementation(self, expression, i1, i2, out, selection, filtered, array_type, parallel, chunk_size, raw, progress)
   6360     mask = self._selection_masks[FILTER_SELECTION_NAME]
   6361     i1, i2 = mask.indices(i1, i2-1)
-> 6362     assert i1 != -1
   6363     i2 += 1
   6364 # TODO: performance: can we collapse the two trims in one?

AssertionError: 

Software information

Additional information No additional information to add.

JovanVeljanoski commented 2 years ago

Thanks for the report! I hope we can fix it soon.

maartenbreddels commented 2 years ago

for the record, somewhat related to https://github.com/vaexio/vaex/issues/2123

maartenbreddels commented 2 years ago

Ok, this now works with master, so probably fixed in #2123

maartenbreddels commented 2 years ago

Should be included in 4.11.1

maartenbreddels commented 2 years ago

Ok, I was too quick, this is really fixed in https://github.com/vaexio/vaex/pull/2163 and will be released in the next version!