Closed willsthompson closed 1 year ago
Using Pandas CategoryDType columns with OrdinalEncoder do not incur a performance penalty.
Pandas' internal categories are intersected with your computed categories in quadratic time, here
https://github.com/scikit-learn-contrib/category_encoders/blob/45110027b263363b53da8564b912e7ca6d5546e8/category_encoders/ordinal.py#L230-L232
categories = [f"Cat{i}" for i in range(10000)] series = pd.Series( categories, pd.CategoricalDtype(categories=categories, ordered=True), )
Subsystem:
This would be a very simple one line change:
categories = list(set(categories).intersection(set(X[col].dtype.categories)))
I'd be happy to get a PR together if this looks okay to you
Hi @willsthompson thanks for pointing that out. Please go ahead and create a PR
Closing re: https://github.com/scikit-learn-contrib/category_encoders/pull/409#issuecomment-1548610903
Expected Behavior
Using Pandas CategoryDType columns with OrdinalEncoder do not incur a performance penalty.
Actual Behavior
Pandas' internal categories are intersected with your computed categories in quadratic time, here
https://github.com/scikit-learn-contrib/category_encoders/blob/45110027b263363b53da8564b912e7ca6d5546e8/category_encoders/ordinal.py#L230-L232
Steps to Reproduce the Problem
Specifications
Subsystem:
Proposed fix
This would be a very simple one line change:
I'd be happy to get a PR together if this looks okay to you