scikit-learn-contrib / category_encoders

A library of sklearn compatible categorical variable encoders
http://contrib.scikit-learn.org/category_encoders/
BSD 3-Clause "New" or "Revised" License
2.39k stars 393 forks source link

Quadratic time intersection on Pandas categories #407

Closed willsthompson closed 1 year ago

willsthompson commented 1 year ago

Expected Behavior

Using Pandas CategoryDType columns with OrdinalEncoder do not incur a performance penalty.

Actual Behavior

Pandas' internal categories are intersected with your computed categories in quadratic time, here

https://github.com/scikit-learn-contrib/category_encoders/blob/45110027b263363b53da8564b912e7ca6d5546e8/category_encoders/ordinal.py#L230-L232

Steps to Reproduce the Problem

  1. Create a Series with a large number of categories, e.g.
    categories = [f"Cat{i}" for i in range(10000)]
    series = pd.Series(
      categories,
      pd.CategoricalDtype(categories=categories, ordered=True),
    )
  2. Apply ordinal encoder to the series

Specifications

I'd be happy to get a PR together if this looks okay to you

PaulWestenthanner commented 1 year ago

Hi @willsthompson thanks for pointing that out. Please go ahead and create a PR

willsthompson commented 1 year ago

Closing re: https://github.com/scikit-learn-contrib/category_encoders/pull/409#issuecomment-1548610903