vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.28k stars 590 forks source link

[BUG-REPORT] `Column.describe_categorical` raises when column names start with numbers #2138

Open honno opened 2 years ago

honno commented 2 years ago

Weird bug(?) I stumbled upon when using (string) numbers as names for categorical columns, and then trying to use the interchange protocol on it.

>>> df = vaex.from_items(("42", np.asarray([3, 1, 1, 2, 0])))
>>> df = df.categorize("42")
>>> interchange_df = df.__dataframe__()
>>> interchange_col = interchange_df.get_column_by_name("42")
>>> interchange_col.describe_categorical
File .../vaex/dataframe_protocol.py:434, in _VaexColumn.describe_categorical(self)
    416 """
    417 If the dtype is categorical, there are two options:
    418 
   (...)
    431                   None if not a dictionary-style categorical.
    432 """
    433 if not self.dtype[0] == _DtypeKind.CATEGORICAL:
--> 434     raise TypeError("`describe_categorical only works on a column with " "categorical dtype!")
    436 ordered = False
    437 is_dictionary = True
TypeError: `describe_categorical only works on a column with categorical dtype!

This works fine (well besides from #2113) if say the name starts with an alphanumeric

>>> df = vaex.from_items(("a42", np.asarray([3, 1, 1, 2, 0])))
>>> df = df.categorize("a42")
>>> interchange_df = df.__dataframe__()
>>> interchange_col = interchange_df.get_column_by_name("a42")
>>> interchange_col.describe_categorical
(False, True, {0: 0, 1: 1, 2: 2, 3: 3})

Using local build of upstream master

cega-000 commented 1 year ago

We've also encountered the bug... the string used for "name" of a selection cannot begin with a number.

import vaex
df = vaex.example()

df.select(df.x < 0.1, name="1", mode="replace")
df.count(
"*",
binby="E",
shape=1024,
limits=[0,10],
selection="1",
delay=True,)
df.execute()

running the above code gives the following error:

File ~/opt/anaconda3/lib/python3.9/site-packages/vaex/scopes.py:198, in _BlockScope.__getitem__(self, variable)
    197 if variable not in self.values:
--> 198     raise KeyError("Unknown variables or column: %r" % (variable,))
    200 return self.values[variable]

KeyError: "Unknown variables or column: '(x < 0.1)'"

However, putting a letter in front corrects this issue... here, we just added an 'a' before the number and it works fine.

import vaex
df = vaex.example()

df.select(df.x < 0.1, name="a1", mode="replace")
df.count(
"*",
binby="E",
shape=1024,
limits=[0,10],
selection="a1",
delay=True,)
df.execute()