Quadratic time intersection on Pandas categories

willsthompson commented 1 year ago

Expected Behavior

Using Pandas CategoryDType columns with OrdinalEncoder do not incur a performance penalty.

Pandas' internal categories are intersected with your computed categories in quadratic time, here

Create a Series with a large number of categories, e.g.

categories = [f"Cat{i}" for i in range(10000)]
series = pd.Series(
  categories,
  pd.CategoricalDtype(categories=categories, ordered=True),
)

Subsystem:

This would be a very simple one line change:

categories = list(set(categories).intersection(set(X[col].dtype.categories)))

I'd be happy to get a PR together if this looks okay to you

PaulWestenthanner commented 1 year ago

Hi @willsthompson thanks for pointing that out. Please go ahead and create a PR

willsthompson commented 1 year ago