scikit-learn-contrib / category_encoders

A library of sklearn compatible categorical variable encoders
http://contrib.scikit-learn.org/category_encoders/
BSD 3-Clause "New" or "Revised" License
2.38k stars 393 forks source link

feature request: Target Encoding via Mixed Linear Models #242

Closed DSoudis closed 4 years ago

DSoudis commented 4 years ago

Would you consider adding a target encoding functionality via Mixed Linear models?

This functionality exists in R via the embed package see here for a discussion, particularly under the "Empirical Bayesian Methods/Mixed Models"

It is quite similar to the MEstimateEncoder, but there are some advantages i can see:

1) Solid statistical theory behind the technique. Mixed effects models are a mature branch of statistics

2) No hyperparameters to tune. The amount of shrinkage is automatically determined through the estimation process. A good reference is Gelman & Hill (2007) "Data Analysis Using Regression and Multilevel/Hierarchical Models" (particularly page 253) In short, the less observations a category has and/or the more the outcome varies for a category then the higher the regularization towards "the prior" or "grand mean".

3) The technique is applicable for both continuous outcomes (via mixed linear regression) or two-class outcomes (via mixed logit regression)

4) Statsmodels has a convenient implementation. Basically:

import statsmodels.formula.api as smf 
from statsmodels.genmod.bayes_mixed_glm import BinomialBayesMixedGLM as bgmm

# Regression, returns prediction per observation equal to the (regularized) mean of the observation's category
md = smf.mixedlm("outcome ~ 1", data, groups=data[category_var]).fit()
mdf.fittedvalues # this has shape of (n_rows, 1)

# Classification, returns (regularized)  log odds per category. Needs mapping to original vector 
gmd = bgmm.from_formula("outcome ~ 1", {'a' : '0 + C(category_var)'}, data).fit_vb()
gmd.random_effects() # this has shape of (n_groups, 1)
janmotl commented 4 years ago

I just pushed MixedEncoder to the master branch.

Please:

  1. Check it for errors.
  2. Check the documentation.
  3. If you know how to avoid formulas, let me know - the current implementation seems to be fragile.
  4. Advise what to do with convergence warnings like:
    ConvergenceWarning: MixedLM optimization failed, trying a different optimizer may help.
DSoudis commented 4 years ago

That was fast!!

I did the following:

  1. I reproduced part of my analysis from R and the CV results where very consistent. I also checked results against statsmodels' module and it was fine. I would like to test this more on a few other personal datasets, but do let me know if you need any tests on open datasets for inclusion here.

  2. Docstrings seem fine to me. I suppose in time you will drop unused parameters such as "sigma" and "random state".

  3. There is a way to avoid formulas. It's quite simple for the regression case. IT takes a bit more effort for the Logit.

    
    import numpy as np
    import pandas as pd
    import statsmodels.api as sm
    from statsmodels.genmod.bayes_mixed_glm import BinomialBayesMixedGLM as bgmm

import numpy as np import pandas as pd import statsmodels.api as sm from statsmodels.genmod.bayes_mixed_glm import BinomialBayesMixedGLM as bgmm

data = sm.datasets.get_rdataset("dietox", "geepack").data

create intercept to include as only exog variable

const = np.ones(data.shape[0])

regression is easy

md = sm.MixedLM(endog = data['Weight'], exog = const, groups=data["Pig"]) mdf = md.fit() print(mdf.summary())

for Logit we must also create dummies for the random effects and an index that i do not quite grasp...

create a dummy to fit the model

data2 = data.assign(Cu = lambda x: np.where(x['Cu'] != 1, 0, x['Cu']))

Below need to be repeated for every column

exog_vc = pd.get_dummies(data2['Pig'])

ident...

ident = np.zeros(re_vc.shape[1], dtype = int)

gmd = bgmm(endog = data2['Cu'], exog = const, exog_vc = exog_vc, ident = ident).fit_vb() print(gmd.summary())



4. Convergence warning normally show up when the parameter for the random effects variance is close to zero. Most of the time this is because of the data and not a problem _per se_. If the categories don't vary much among them this parameters tend to zero. So most of the times it is OK to silence as you have done. `embed` seems to be doing the same

5. Perhaps add a warning if someone uses a binary outcome and does not specify binary_classification = True? They will still get a results, but might not be as accurate?
janmotl commented 4 years ago

Thanks. I updated the code. If you find a bug, open a new issue.