melff / mclogit

mclogit: Multinomial Logit Models, with or without Random Effects or Overdispersion
http://melff.github.io/mclogit/
22 stars 4 forks source link

Feature request: allow some form of regularisation in multinomial fits? #30

Open tomwenseleers opened 1 year ago

tomwenseleers commented 1 year ago

More as a feature request: do you think it might be possible to allow some form of regularisation in mblogit multinomial fits? When one is dealing with sparse data, one now often encounters problems of complete separation, resulting in very large errors. This could be resolved if some form of regularisation could be added. I understand that for logistic regression, a ridge (L2 norm) penalty on the coefficients can be added just by changing the IRLS algorithm (https://bwlewis.github.io/GLM/) so that in the least square regression step of the IRLS algorithm it uses a ridge regression instead of a regular OLS regression. This can be done just by augmenting the covariate matrix X with a p x p covariate matrix with p = nr of columns of the covariate matrix and sqrt(lambdas) along the diagonal and augmenting the outcome variable with p zeros (with lambdas=a vector of L2 norm penalty coefficients=lambda*penalty weights, with vector penalty weights typically being set to 0 for the intercept and to 1 for the rest, or in case of adaptive ridge regression to 1/(abs(coef)) or 1/(coef^2), where coef are the coefficients obtained e.g. in an earlier ridge regression) (see https://stats.stackexchange.com/questions/69205/how-to-derive-the-ridge-regression-solution and https://stats.stackexchange.com/questions/137057/ridge-penalized-glms-using-row-augmentation). I presume the same recipe could be applied for a multinomial regression, in which case incorporating such a feature might not be too hard - in that case it could just be a matter of changing line 37 in https://github.com/melff/mclogit/blob/master/pkg/R/mclogit-fit.R to a weighted least square ridge regression.

For example, in a multinomial model looking at SARS-CoV2 lineage dynamics through time in different countries & continents lineage ~ date+date:continent+date:country+country it would be nice if I could penalize just the date:country logistic slopes, and leave all the intercepts & the per-continent logistic slopes unpenalized, in order to shrink the per-country logistic slopes a little towards the ones of the continent where the country is in.

Do you think such a feature could be incorporated in the future?

I saw that package MGLM tried to incorporate LASSO penalties for multinomial models in their function https://www.rdocumentation.org/packages/MGLM/versions/0.2.1/topics/MGLMsparsereg but I had no luck getting it to work (and it also didn't provide a vcov method). And glmnet of course allows for elastic net regularised multinomial models (but that one doesn't return the variance-covariance matrix and doesn't have a formula interface).

In terms of regularised (L1/LASSO or L2/ridge penalized) multinomial models I also saw this one, https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1424458. Though I believe the variance-covariance matrix only has a closed-form solution for the ridge penalized case.

melff commented 1 year ago

I was thinking about adapting Firth's penalized likelihood to multinomial logit, since it also seems to help with separation. However, the required third derivatives are quite tedious in the multinomial case. So some sort of ridge or LASSO penalty may be worth looking into. Yet I would have to familiarize myself with these and that may take some time. Any pointers to relevant literature would be helpful.

tomwenseleers commented 1 year ago

For a Firth's penalized multinomial regression there was this now archived R package - but when I tried it was super slow: https://github.com/cran/pmlr and I saw this article: https://www.sciencedirect.com/science/article/pii/S1755534518300836 (I understand this is adding a Jeffrey's prior to stabilize the coefficients). In terms of articles about L1/LASSO or L2/ridge penalized multinomial regressions (corresponding to adding Laplacian and Gaussian priors, respectively), this one seemed like quite a good one: https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1424458. I haven't yet seen that algo ported to R yet, though I saw a Matlab, C & C# implementation here : https://github.com/slavomir-sidor/princeton-mvpa-toolbox/blob/master/core/learn/smlr.m https://github.com/slavomir-sidor/princeton-mvpa-toolbox/blob/master/core/learn/smlr_mex.c https://github.com/inzxx/adastra/blob/master/src/Accord24/Accord.Statistics/Models/Regression/Nonlinear/Fitting/LowerBoundNewtonRaphson.cs

In this code they seem to implement only the LASSO penalized ones (which gives a sparse solution, but with the downside that there is no closed-form solution for the vcov matrix - it could be obtained via numerical methods though), but the article also mentions how to put in ridge penalties (there the vcov matrix does have a closed-form solution).

To calculate the variance-covariance matrix of multinomial regressions I wrote this fast Rcpp function a while ago, which uses Kronecker products to make things go fast (there for a model without any ridge penalty, but that would be easy to add I believe) : https://stackoverflow.com/questions/73811835/faster-way-to-calculate-the-hessian-fisher-information-matrix-of-a-nnetmulti

I personally think that just a multinomial regression with ridge penalties added (lambda & adaptive penalty weights, i.e. so that it uses a vector lambdas = lambda*penalty weights) would already be really nice - to do that changing line 37 in https://github.com/melff/mclogit/blob/master/pkg/R/mclogit-fit.R from a regular weighted least square regression to a weighted least square ridge regression might in that case be enough (taking care to leave all intercept terms unpenalized) (plus some small changes to the code used to calculate the vcov matrix). This would not result in a sparse model, but the vcov matrix would have a closed form solution. LASSO regularised would be sparse, but then without a closed form solution for the vcov matrix, so you would then have to get that numerically.

In some articles, they also use adaptive ridge penalties and then iterate this procedure to penalize some variables down to zero & approach L0-penalized GLMs (ie with an appropriately chosen lambda do best subset selection, with lambda e.g. chosen via cross validation), see https://www.sciencedirect.com/science/article/abs/pii/S037837582030135X https://www.tandfonline.com/doi/full/10.1080/03610926.2022.2028841 https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0148620 https://www.sciencedirect.com/science/article/pii/S0047259X17305067

In terms of available code or other attempts to code regularised multinomial models, these may perhaps also be relevant (didn't have too much luck myself in getting them to work on my problems though): https://github.com/villardon/MultBiplotR/blob/master/R/RidgeMultinomialLogisticRegression.R https://cran.r-project.org/web/packages/MGLM/index.html https://github.com/zeemkr/ncpen