DSoudis commented 4 years ago

Would you consider adding a target encoding functionality via Mixed Linear models?

This functionality exists in R via the embed package see here for a discussion, particularly under the "Empirical Bayesian Methods/Mixed Models"

It is quite similar to the MEstimateEncoder, but there are some advantages i can see:

1) Solid statistical theory behind the technique. Mixed effects models are a mature branch of statistics

2) No hyperparameters to tune. The amount of shrinkage is automatically determined through the estimation process. A good reference is Gelman & Hill (2007) "Data Analysis Using Regression and Multilevel/Hierarchical Models" (particularly page 253) In short, the less observations a category has and/or the more the outcome varies for a category then the higher the regularization towards "the prior" or "grand mean".

3) The technique is applicable for both continuous outcomes (via mixed linear regression) or two-class outcomes (via mixed logit regression)

4) Statsmodels has a convenient implementation. Basically:

import statsmodels.formula.api as smf 
from statsmodels.genmod.bayes_mixed_glm import BinomialBayesMixedGLM as bgmm

# Regression, returns prediction per observation equal to the (regularized) mean of the observation's category
md = smf.mixedlm("outcome ~ 1", data, groups=data[category_var]).fit()
mdf.fittedvalues # this has shape of (n_rows, 1)

# Classification, returns (regularized)  log odds per category. Needs mapping to original vector 
gmd = bgmm.from_formula("outcome ~ 1", {'a' : '0 + C(category_var)'}, data).fit_vb()
gmd.random_effects() # this has shape of (n_groups, 1)

janmotl commented 4 years ago

I just pushed MixedEncoder to the master branch.

Please:

Check it for errors.
Check the documentation.
If you know how to avoid formulas, let me know - the current implementation seems to be fragile.

Advise what to do with convergence warnings like:

ConvergenceWarning: MixedLM optimization failed, trying a different optimizer may help.

DSoudis commented 4 years ago

That was fast!!

I did the following:

I reproduced part of my analysis from R and the CV results where very consistent. I also checked results against statsmodels' module and it was fine. I would like to test this more on a few other personal datasets, but do let me know if you need any tests on open datasets for inclusion here.
Docstrings seem fine to me. I suppose in time you will drop unused parameters such as "sigma" and "random state".

There is a way to avoid formulas. It's quite simple for the regression case. IT takes a bit more effort for the Logit.


import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.genmod.bayes_mixed_glm import BinomialBayesMixedGLM as bgmm

import numpy as np import pandas as pd import statsmodels.api as sm from statsmodels.genmod.bayes_mixed_glm import BinomialBayesMixedGLM as bgmm

data = sm.datasets.get_rdataset("dietox", "geepack").data

create intercept to include as only exog variable

const = np.ones(data.shape[0])

regression is easy

md = sm.MixedLM(endog = data['Weight'], exog = const, groups=data["Pig"]) mdf = md.fit() print(mdf.summary())

for Logit we must also create dummies for the random effects and an index that i do not quite grasp...

create a dummy to fit the model

data2 = data.assign(Cu = lambda x: np.where(x['Cu'] != 1, 0, x['Cu']))

Below need to be repeated for every column

exog_vc = pd.get_dummies(data2['Pig'])

ident...

ident = np.zeros(re_vc.shape[1], dtype = int)

gmd = bgmm(endog = data2['Cu'], exog = const, exog_vc = exog_vc, ident = ident).fit_vb() print(gmd.summary())



4. Convergence warning normally show up when the parameter for the random effects variance is close to zero. Most of the time this is because of the data and not a problem _per se_. If the categories don't vary much among them this parameters tend to zero. So most of the times it is OK to silence as you have done. `embed` seems to be doing the same

5. Perhaps add a warning if someone uses a binary outcome and does not specify binary_classification = True? They will still get a results, but might not be as accurate?