Closed DSoudis closed 4 years ago
I just pushed MixedEncoder
to the master branch.
Please:
ConvergenceWarning: MixedLM optimization failed, trying a different optimizer may help.
That was fast!!
I did the following:
I reproduced part of my analysis from R and the CV results where very consistent. I also checked results against statsmodels' module and it was fine. I would like to test this more on a few other personal datasets, but do let me know if you need any tests on open datasets for inclusion here.
Docstrings seem fine to me. I suppose in time you will drop unused parameters such as "sigma" and "random state".
There is a way to avoid formulas. It's quite simple for the regression case. IT takes a bit more effort for the Logit.
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.genmod.bayes_mixed_glm import BinomialBayesMixedGLM as bgmm
import numpy as np import pandas as pd import statsmodels.api as sm from statsmodels.genmod.bayes_mixed_glm import BinomialBayesMixedGLM as bgmm
data = sm.datasets.get_rdataset("dietox", "geepack").data
const = np.ones(data.shape[0])
md = sm.MixedLM(endog = data['Weight'], exog = const, groups=data["Pig"]) mdf = md.fit() print(mdf.summary())
data2 = data.assign(Cu = lambda x: np.where(x['Cu'] != 1, 0, x['Cu']))
exog_vc = pd.get_dummies(data2['Pig'])
ident = np.zeros(re_vc.shape[1], dtype = int)
gmd = bgmm(endog = data2['Cu'], exog = const, exog_vc = exog_vc, ident = ident).fit_vb() print(gmd.summary())
4. Convergence warning normally show up when the parameter for the random effects variance is close to zero. Most of the time this is because of the data and not a problem _per se_. If the categories don't vary much among them this parameters tend to zero. So most of the times it is OK to silence as you have done. `embed` seems to be doing the same
5. Perhaps add a warning if someone uses a binary outcome and does not specify binary_classification = True? They will still get a results, but might not be as accurate?
Thanks. I updated the code. If you find a bug, open a new issue.
Would you consider adding a target encoding functionality via Mixed Linear models?
This functionality exists in R via the
embed
package see here for a discussion, particularly under the "Empirical Bayesian Methods/Mixed Models"It is quite similar to the MEstimateEncoder, but there are some advantages i can see:
1) Solid statistical theory behind the technique. Mixed effects models are a mature branch of statistics
2) No hyperparameters to tune. The amount of shrinkage is automatically determined through the estimation process. A good reference is Gelman & Hill (2007) "Data Analysis Using Regression and Multilevel/Hierarchical Models" (particularly page 253) In short, the less observations a category has and/or the more the outcome varies for a category then the higher the regularization towards "the prior" or "grand mean".
3) The technique is applicable for both continuous outcomes (via mixed linear regression) or two-class outcomes (via mixed logit regression)
4) Statsmodels has a convenient implementation. Basically: