Open diegodebrito opened 3 years ago
adding a small L2 penalization stabilizes it for some cases, but it still jumps suddenly to zero params, and alpha=0 doesn't produce correct results.
It could be that the Tweedie loglike is not accurate enough. The default fit
does not need to use loglike for the optimization.
I added a constant to the model x = sm.add_constant(x)
but that doesn't change the behavior.
print("fit", model2.fit().params)
for w in np.linspace(0, 0.08, 9):
result_ = model2.fit_regularized(L1_wt=0.99, alpha=w, maxiter=200,
cnvrg_tol=0.0001)
print(w, result_.params)
fit [-0.01422211 1.00826417 -1.00142452 0.00863553 0.50976335]
0.0 [-89.70510123 0. 0. 0. 0. ]
0.01 [ 0. 1.00207192 -0.9959421 0.00504013 0.50551591]
0.02 [ 0. 1.00211793 -0.99571097 0.00502683 0.50552728]
0.03 [ 0. 1.00236479 -0.99468019 0. 0.50592973]
0.04 [ 0. 1.00248133 -0.99396627 0. 0.50592665]
0.05 [ 0. 1.00258026 -0.99325759 0. 0.50545311]
0.06 [ 0. 1.00260533 -0.99279867 0.00382331 0.50470115]
0.07 [ 0. 1.00277812 -0.9918399 0. 0.5045055 ]
0.08 [0. 0. 0. 0. 0.]
aside: warning if not converged, uses plain warning that warns only once.
should use our ConvergenceWarning which warns always (AFAIR)
GLM.fit with irls does not warn at all, scipy optimizers print convergence statement by default with disp
options.
statsmodels\genmod\generalized_linear_model.py:1323: UserWarning: Elastic net fitting did not converge warnings.warn("Elastic net fitting did not converge")
if not result.converged:
warnings.warn("Elastic net fitting did not converge")
Hi @josef-pkt, thanks for your answers and all the work you've done porting this sort of tool to Python!
Doing some more investigation, I noticed that the score is being used on the optimization. From my understanding, this makes sense, since you are optimizing the log-likelihood + the penalty term.
Looking at some references, I see that the dispersion is usually not present on the penalized likelihood. A few examples:
https://cran.r-project.org/web/packages/glmnet/vignettes/glmnetFamily.pdf (first equation) https://web.stanford.edu/~hastie/THESES/meeyoung_park.pdf (equation 3.3) https://scholarworks.rit.edu/cgi/viewcontent.cgi?referer=&httpsredir=1&article=2822&context=article (equation 5)
I'm wondering if the sensitive results are being caused by the estimate of dispersion that is implicit on the score calculation. I know that Statsmodels uses the x2 Pearson estimator, and it can be quite high sometimes.
I did a few tests here changing the scale parameter manually, and I noticed that the coefficients are more stable if I use a smaller scale value. Something like this:
result2 = model2.fit_regularized(L1_wt=1, alpha=0.5, maxiter=600, cnvrg_tol=0.001, score_kwds={"scale": scale})
Would this explain some of the behavior? Could the large scale factor be reducing the importance of the likelihood term on the penalization?
I would really appreciate any references on this (particularly on solving this problem for non-logistic glm models, to see something more complicated).
Thanks again for your work and for taking a look at this!
Another comment that might be relevant: glmnet (R) seems to be using working response and weights. I'm still learning all this, but it seems like it's a slightly different approach than using the score? Very similar, but not including the dispersion parameter.
https://github.com/cran/glmnet/blob/f4fc95ab49efaad9b6e1728a7c840bc6159501dc/R/glmnetFlex.R#L585
I guess my question could be summarized by: do we really need to estimate dispersion to compute fit_regularized?
It is possible that there is a problem or an inconsistency across methods in the scale handling in fit_regularized.
scale handling in GLM is tricky because we want to include it but don't really need it during optimization.
We had problems for fit
with the scale handling but now _fit_gradient
explicitly sets the scale temporarily to 1, so it's ignored during optimization, but then correctly estimated and used for inference.
fit_regularized doesn't do anything explicitly with scale, so scale could possibly differ between loglike and score.
trying out in my notebook with approximately your example fixing scale doesn't seem to help, it doesn't change result of full penalization to 0 if L1_wt=1
I fixed scale by changing the attribute model2.scaletype = 1
before calling fit_regularized.
our fit_elasticnet in statsmodels.base is generic and uses scipy brent
for single parameter optimization for coord_descent.
We don't have an irls equivalent as optimizer for elastic net in GLM.
My guess is that there is something that the optimizer goes too fast to a small active set, and then doesn't include those parameters anymore. But I never looked really closely at that code.
for checking: using the unpenalized params as start_params for L1 penalization works, even with much large alpha
with fixed scale:
model2.fit_regularized(L1_wt=1, alpha=w, maxiter=200,
cnvrg_tol=0.0001, start_params=result1.params)
print("fit", model2.fit().params)
model2.scaletype = 40.
for w in np.linspace(0, 0.8, 9):
result_ = model2.fit_regularized(L1_wt=1, alpha=w, maxiter=200,
cnvrg_tol=0.0001, start_params=result1.params)
print(w, result_.params)
fit [-0.01422211 1.00826417 -1.00142452 0.00863553 0.50976335]
0.0 [-0.01422211 1.00826417 -1.00142452 0.00863553 0.50976335]
0.1 [-0.08731369 1.02946015 -1.01962233 0.00920581 0.51814452]
0.2 [-0.07553214 1.02149855 -1.00972095 0. 0.51153246]
0.30000000000000004 [ 0. 0.98957951 -0.98197254 0. 0.50820342]
0.4 [ 0. 0.98468465 -0.97712582 0. 0.50904197]
0.5 [ 0. 0.98521377 -0.96753847 0. 0.48486038]
0.6000000000000001 [ 0. 0.98130645 -0.9612959 0. 0.47991791]
0.7000000000000001 [ 0. 0.97731195 -0.95471168 0. 0.47474747]
0.8 [ 0. 0.97313611 -1.00464121 0. 0.46737922]
and without fixed scale
print("fit", model2.fit().params)
#model2.scaletype = 40.
for w in np.linspace(0, 0.8, 9):
result_ = model2.fit_regularized(L1_wt=1, alpha=w, maxiter=200,
cnvrg_tol=0.0001, start_params=result1.params)
print(w, result_.params)
fit [-0.01422211 1.00826417 -1.00142452 0.00863553 0.50976335]
0.0 [-0.01422211 1.00826417 -1.00142452 0.00863553 0.50976335]
0.1 [-0.08731369 1.02946015 -1.01962233 0.00920581 0.51814452]
0.2 [-0.07553214 1.02149855 -1.00972095 0. 0.51153246]
0.30000000000000004 [ 0. 0.98957951 -0.98197254 0. 0.50820342]
0.4 [ 0. 0.98468465 -0.97712582 0. 0.50904197]
0.5 [ 0. 0.98521377 -0.96753847 0. 0.48486038]
0.6000000000000001 [ 0. 0.98130645 -0.9612959 0. 0.47991791]
0.7000000000000001 [ 0. 0.97731195 -0.95471168 0. 0.47474747]
0.8 [ 0. 0.97313611 -1.00464121 0. 0.46737922]
parameter estimates drop to zero when alpha is around 1.5
BTW: R glmnet is GPL which is not compatible with our BSD-3 license. So, we are not allowed to look at or translate their code.
@kshedden
warm start when increasing penalty weights also works up to around 1.5
print("fit", model2.fit().params)
#model2.scaletype = 40.
start_params = result1.params
for w in np.linspace(0, 1.5, 9):
result_ = model2.fit_regularized(L1_wt=1, alpha=w, maxiter=200,
cnvrg_tol=0.0001, start_params=start_params) #result1.params)
start_params = result_.params
print(w, result_.params)
fit [-0.01422211 1.00826417 -1.00142452 0.00863553 0.50976335]
0.0 [-0.01422211 1.00826417 -1.00142452 0.00863553 0.50976335]
0.1875 [-0.07669628 1.02233972 -1.0108411 0. 0.51238216]
0.375 [ 0. 0.98591468 -0.97835096 0. 0.50881631]
0.5625 [ 0. 1.01207304 -0.95702432 0. 0.4793386 ]
0.75 [ 0. 0.97689093 -0.95305929 0. 0.47034834]
0.9375 [ 0. 0.95596966 -0.94703857 0. 0.45759647]
1.125 [ 0. 1.02562342 -0.90532962 0. 0.43927843]
1.3125 [ 0. 0.9226101 -0.91569423 0. 0.41404356]
1.5 [0. 0. 0. 0. 0.]
In this example parameters index 1, 2, 4 are highly significant in unpenalized regression, It's not easy to pull them smoothly towards zero.
I think the issue is that our score and hessian computations use identities that only hold for likelihoods derived from proper probability distributions. Tweedie EQS is not such a thing. The following code suggests that the Tweedie gradient and Hessian are wrong (switch the family to Poisson and everything checks out). We may need to implement special-case score and hessian computations for Tweedie.
import numdifftools as nd
fam = sm.families.Tweedie(var_power=p, eql=True)
model = sm.GLM(y, x, family=fam)
result = model.fit()
ngrad = nd.Gradient(lambda x: model.loglike(x))(result.params)
agrad = model.score(result.params)
nhess = nd.Hessian(lambda x: model.loglike(x))(result.params)
ahess = model.hessian(result.params)
the above also fails for other models with scale not fixed, e.g. for gamma
It looks like after fixing the scale, the hessian agrees, but not the score function I'm evaluating away from mle parameter, so score is not zero plus noise.
import numdifftools as nd
fam = sm.families.Tweedie(var_power=p, eql=True)
#fam = sm.families.Gamma(link=sm.families.links.log())
model = sm.GLM(y, x, family=fam)
model.scaletype = 1.
result = model.fit()
model.scaletype = 1.
a = 1.1
ngrad = nd.Gradient(lambda x: model.loglike(x))(result.params * a)
agrad = model.score(result.params * a)
nhess = nd.Hessian(lambda x: model.loglike(x))(result.params * a)
ahess = model.hessian(result.params * a)
np.max(np.abs(ngrad - agrad)), np.max(np.abs(nhess - ahess))
(375.00000000000125, 6.068603397579864e-10)
aside: fit
resets the scaletype, independent of what it already is.
something strange with tweedie loglike
Even if it is an extended quasi-likelihood, it should still reduce to estimating the MLE for fixed scale.
using an optimizer that uses only loglike does not reproduce MLE (MLE estimate based only on derivatives, irls and newton)
result_g = model.fit(method="nm", scale=1, start_params=result_n.params, maxiter=5000, maxfun=5000)
I fixed scale by changing the attribute
model2.scaletype = 1
before calling fit_regularized.
@josef-pkt I found something that could help explain the issue hopefully.
Inside the elastic_net.py module, there is a step where a function called npscore is defined (https://github.com/statsmodels/statsmodels/blob/1a579185724949cf671705ae4f743b449977cd5f/statsmodels/base/elastic_net.py#L47).
Notice how it passes new arguments to score method , so I maybe it is carrying the attribute model2.scaletype like you defined above! That is why I passed a dictionary with score_kwds in my examples (because I noticed that that dictionary is used when defining npscore).
If you run something like this for example:
for alpha in np.linspace(0, 0.1, 11):
result = model2.fit_regularized(L1_wt=1, alpha=alpha, maxiter=200, cnvrg_tol=0.01, score_kwds={"scale": 1})
print(f"alpha = {alpha}")
print(result.params)
You will get results like these:
alpha = 0.0
[ 1.09322585 -0.9727067 0.02457221 0.44775905]
alpha = 0.01
[ 1.08966576 -0.97402498 0.01705443 0.44316493]
alpha = 0.02
[ 1.08577813 -0.97541566 0.00955056 0.43861726]
alpha = 0.03
[ 1.0816298 -0.97712266 0.00638619 0.43399861]
alpha = 0.04
[ 1.07736718 -0.97912454 0.00786077 0.42918138]
alpha = 0.05
[ 1.06405853 -0.98352536 0.00743217 0.42897913]
alpha = 0.06
[ 1.05937738 -0.9849781 0. 0.42479757]
alpha = 0.07
[ 1.05456426 -0.97926265 0. 0.42003413]
alpha = 0.08
[ 1.04138218 -0.98317384 0. 0.41902822]
alpha = 0.09
[ 1.04432254 -0.98180734 0. 0.40996421]
alpha = 0.1
[ 1.03897094 -0.98311209 0. 0.40497876]
If you just set model2.scaletype = 1 before calling the loop, it produces the same results as before (unstable):
model2.scaletype = 1.0
for alpha in np.linspace(0, 0.1, 11):
result = model2.fit_regularized(L1_wt=1, alpha=alpha, maxiter=200, cnvrg_tol=0.01)
print(f"alpha = {alpha}")
print(result.params)
alpha = 0.0
[ 1.00392896 -0.99552302 0.00816112 0.50514782]
alpha = 0.01
[ 1.00333492 -0.99568883 0.00747765 0.50524213]
alpha = 0.02
[ 1.00299068 -0.99575727 0.00677621 0.50532022]
alpha = 0.03
[ 1.00332497 -0.99493846 0.00608081 0.50539896]
alpha = 0.04
[ 1.00236527 -0.99430337 0.00495188 0.50555515]
alpha = 0.05
[ 1.00200547 -0.99376493 0.00497748 0.50569831]
alpha = 0.06
[ 1.00199814 -0.99606201 0.0040994 0.50558269]
alpha = 0.07
[ 1.00128377 -0.99591126 0. 0.50431653]
alpha = 0.08
[0. 0. 0. 0.]
alpha = 0.09
[0. 0. 0. 0.]
alpha = 0.1
[0. 0. 0. 0.]
It is hard to see what is happening to the scale parameter... my point from the comment above: it seems like the score calculated during the elastic_net optimization was not using scale=1 when passing scaletype=1 like you did. Please correct me if I'm wrong!
Ok, another piece to the puzzle: inside fit_elasticnet, univariate models are defined. Since these are defined from scratch, they will have scaletype=None, so whenever score is called for these models, the scale parameter will be estimated (instead of using 1 like we want for some of the examples above).
Using score_kwds works because this dictionary is passed to model.score inside the npscore (gradient) function.
Thanks for investigating this. I think you are right about the scale part. We will have to add those keyword options inside GLM.fit_regularized
.
Did you check fit_regularized of any of the other families with scale set at 1, gaussian, gamma, ...? The problem with scale handling should show up in the same or similar way with those families also.
Tweedie still has the additional problem that the log-likelihood function is just an approximation and not the true loglike.
My pleasure, @josef-pkt!
I tested with gamma and the behavior is similar. Some code comparing both:
import statsmodels.api as sm
import numpy as np
from statsmodels.datasets import cpunish
from sklearn.preprocessing import MinMaxScaler
# loading and scaling the data
data = cpunish.load_pandas()
scaler = MinMaxScaler()
X = sm.add_constant(scaler.fit_transform(data.exog))
# gamma distribution
family = sm.families.Gamma(link=sm.families.links.log())
model = sm.GLM(endog=data.endog, exog=X, family=family)
for alpha in np.linspace(0, 0.1, 11):
result = model.fit_regularized(L1_wt=1,
alpha=alpha,)
print(f"alpha = {alpha}")
print(result.params.values)
alpha = 0.0
[ 1.00939923 3.06580906 -0.00383133 -2.02574066 -0.32918837 1.58922577
-2.63288559]
alpha = 0.01
[ 0.93920064 2.54766162 0.49041371 0. -1.15889221 0.78228651
-2.6785279 ]
alpha = 0.02
[ 0.90852312 3.12739928 0. -1.87698014 -0.24372091 1.55441484
-2.63880967]
alpha = 0.03
[0.6404057 0. 0. 0. 0. 1.71336537
0. ]
alpha = 0.04
[0. 0. 0. 0. 0. 0. 0.]
alpha = 0.05
[0. 0. 0. 0. 0. 0. 0.]
alpha = 0.06
[0. 0. 0. 0. 0. 0. 0.]
alpha = 0.07
[0. 0. 0. 0. 0. 0. 0.]
alpha = 0.08
[0. 0. 0. 0. 0. 0. 0.]
alpha = 0.09
[0. 0. 0. 0. 0. 0. 0.]
alpha = 0.1
[0. 0. 0. 0. 0. 0. 0.]
for alpha in np.linspace(0, 0.1, 11):
result = model.fit_regularized(L1_wt=1,
alpha=alpha,
score_kwds={"scale": 1})
print(f"alpha = {alpha}")
print(result.params.values)
alpha = 0.0
[ 1.27473353 2.62040181 -0.21150152 -1.97858454 -0.38716299 1.59247365
-2.47523349]
alpha = 0.01
[ 1.37574642 2.01986334 -0.13379665 -1.42900167 -0.52673751 1.34598802
-2.23650554]
alpha = 0.02
[ 1.44243789 1.36962666 -0.10412645 -1.00786172 -0.61426267 1.34079782
-1.91517913]
alpha = 0.03
[ 0.99424751 2.78925445 0. -1.13073303 -0.47419459 1.23462334
-2.59195276]
alpha = 0.04
[ 1.44220862 0. 0. -1.07839655 -0.31967648 1.57910856
-0.74003859]
alpha = 0.05
[ 1.40896323 0. 0. 0. -0.78595435 1.3477018
-0.94750249]
alpha = 0.06
[ 1.29324222 0. 0. 0. -0.65807816 1.40103612
-0.82548472]
alpha = 0.07
[ 1.21346773 0. 0. 0. -0.49987488 1.42137742
-0.78088003]
alpha = 0.08
[ 1.05090512 0. 0. 0. -0.36170904 1.49860479
-0.57473044]
alpha = 0.09
[ 0.90041326 0. 0. 0. -0.22721808 1.5815201
-0.62570787]
alpha = 0.1
[ 0.75297452 0. 0. 0. 0. 1.61722715
-0.25903994]
This should be fixed by #7571. I will merge in a few days if no objections.
Describe the bug
The estimated coefficients are extremely sensitive to variations in the alpha parameter when using the Tweedie distribution. Not only it is very hard to find alpha values that generate non-null coefficients, but tiny variations on those values of alpha will drastically change the estimated coefficients.
Code Sample, a copy-pastable example if possible
If the issue has not been resolved, please file it in the issue tracker.
Expected Output
There isn't a fixed expected output for this, as it will depend on the data, model configurations, etc. In any case, this type of variations seems unreasonable and it doesn't match the behavior when using other packages (like H2O for example).
Output of
import statsmodels.api as sm; sm.show_versions()