scikit-learn-contrib / category_encoders

A library of sklearn compatible categorical variable encoders
http://contrib.scikit-learn.org/category_encoders/
BSD 3-Clause "New" or "Revised" License
2.41k stars 396 forks source link

Possible SummaryEncoder doc error #338

Closed glevv closed 2 years ago

glevv commented 2 years ago

Expected Behavior

SummaryEncoder should return N*cat_features columns, where N is the number of quantiles used to describe each category, at least this is stated in the original paper section 2.1

A generalization of the quantile encoder is to compute several features corre- sponding to different quantiles per each categorical feature, instead of a single feature

Actual Behavior

Docs example states that SummaryEncoder returns 1*cat_features

from category_encoders import * import pandas as pd from sklearn.datasets import load_boston bunch = load_boston() y = bunch.target X = pd.DataFrame(bunch.data, columns=bunch.feature_names) enc = SummaryEncoder(cols=["CHAS", "RAD"], quantiles=[0.25, 0.5, 0.75]).fit(X, y) numeric_dataset = enc.transform(X) print(numeric_dataset.info()) <class 'pandas.core.frame.DataFrame'> RangeIndex: 506 entries, 0 to 505 Data columns (total 13 columns): CRIM 506 non-null float64 ZN 506 non-null float64 INDUS 506 non-null float64 CHAS 506 non-null float64 NOX 506 non-null float64 RM 506 non-null float64 AGE 506 non-null float64 DIS 506 non-null float64 RAD 506 non-null float64 TAX 506 non-null float64 PTRATIO 506 non-null float64 B 506 non-null float64 LSTAT 506 non-null float64 dtypes: float64(13) memory usage: 51.5 KB None where it should be something like this

from category_encoders import * import pandas as pd from sklearn.datasets import load_boston bunch = load_boston() y = bunch.target X = pd.DataFrame(bunch.data, columns=bunch.feature_names) enc = SummaryEncoder(cols=["CHAS", "RAD"], quantiles=[0.25, 0.5, 0.75]).fit(X, y) numeric_dataset = enc.transform(X) print(numeric_dataset.info()) <class 'pandas.core.frame.DataFrame'> RangeIndex: 506 entries, 0 to 505 Data columns (total 17 columns): CRIM 506 non-null float64 ZN 506 non-null float64 INDUS 506 non-null float64 CHAS_25 506 non-null float64 CHAS_50 506 non-null float64 CHAS_75 506 non-null float64 NOX 506 non-null float64 RM 506 non-null float64 AGE 506 non-null float64 DIS 506 non-null float64 RAD_25 506 non-null float64 RAD_50 506 non-null float64 RAD_75 506 non-null float64 TAX 506 non-null float64 PTRATIO 506 non-null float64 B 506 non-null float64 LSTAT 506 non-null float64 dtypes: float64(17) memory usage: 51.5 KB None

PaulWestenthanner commented 2 years ago

You're right. @cmougan this was probably a copy-paste error?

cmougan commented 2 years ago

Yes, it's a copy paste issue.

It returns:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 19 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   CRIM     506 non-null    float64
 1   ZN       506 non-null    float64
 2   INDUS    506 non-null    float64
 3   CHAS     506 non-null    float64
 4   NOX      506 non-null    float64
 5   RM       506 non-null    float64
 6   AGE      506 non-null    float64
 7   DIS      506 non-null    float64
 8   RAD      506 non-null    float64
 9   TAX      506 non-null    float64
 10  PTRATIO  506 non-null    float64
 11  B        506 non-null    float64
 12  LSTAT    506 non-null    float64
 13  CHAS_25  506 non-null    float64
 14  RAD_25   506 non-null    float64
 15  CHAS_50  506 non-null    float64
 16  RAD_50   506 non-null    float64
 17  CHAS_75  506 non-null    float64
 18  RAD_75   506 non-null    float64

Currently you can't use Summary Encoder or Quantile Encoder because they are not yet released. While there is not a new update of category_encoders package you can use the implementation that we use on the original paper in pip install sktools

cmougan commented 2 years ago

@PaulWestenthanner maybe we could do a package release?

PaulWestenthanner commented 2 years ago

We definitely should release. Unfortunately I do not have the rights to do so...

cmougan commented 2 years ago

@PaulWestenthanner who does?

wdm0006 commented 2 years ago

@PaulWestenthanner you should have rights. If you update the version in init.py and the changelog, then go into the releases page of github and draft a new release (tag it with the release number) then the github action should take care of the rest.

PaulWestenthanner commented 2 years ago

Ah, I didn't know that. Sorry that I postponed the release for so long. It worked like charm though. The new version is visible in PyPI. Thanks a lot @wdm0006 !