statsmodels / statsmodels

Statsmodels: statistical modeling and econometrics in Python
http://www.statsmodels.org/devel/
BSD 3-Clause "New" or "Revised" License
10.18k stars 2.89k forks source link

fit_regularized method is unstable/too sensitive to alpha when using Tweedie distribution #7476

Open diegodebrito opened 3 years ago

diegodebrito commented 3 years ago

Describe the bug

The estimated coefficients are extremely sensitive to variations in the alpha parameter when using the Tweedie distribution. Not only it is very hard to find alpha values that generate non-null coefficients, but tiny variations on those values of alpha will drastically change the estimated coefficients.

Code Sample, a copy-pastable example if possible

# based on test case from https://github.com/statsmodels/statsmodels/blob/main/statsmodels/genmod/tests/test_glm.py
# line 1998

import statsmodels.api as sm
import numpy as np

np.random.seed(3242)
n = 500
p = 1.5 # Tweedie variance power
x = np.random.normal(size=(n, 4))
lpr = np.dot(x, np.r_[1, -1, 0, 0.5])
mu = np.exp(lpr)
lam = 10 * mu**(2 - p) / (2 - p)
alp = (2 - p) / (p - 1)
bet = 10 * mu**(1 - p) / (p - 1)

y = np.empty(n)
N = np.random.poisson(lam)
for i in range(n):
    y[i] = np.random.gamma(alp, 1 / bet[i], N[i]).sum()

fam = sm.families.Tweedie(var_power=p, eql=True)
model2 = sm.GLM(y, x, family=fam)
result2 = model2.fit_regularized(L1_wt=1, alpha=0.07, maxiter=200,
               cnvrg_tol=0.01)
result2.params

# all coefficients will be zero in this case
result3 = model2.fit_regularized(L1_wt=1, alpha=0.08, maxiter=200,
               cnvrg_tol=0.01)
result3.params
**Note**: As you can see, there are many issues on our GitHub tracker, so it is very possible that your issue has been posted before. Please check first before submitting so that we do not have to handle and close duplicates. **Note**: Please be sure you are using the latest released version of `statsmodels`, or a recent build of `main`. If your problem has been fixed in an unreleased version, you might be able to use `main` until a new release occurs. **Note**: If you are using a released version, have you verified that the bug exists in the main branch of this repository? It helps the limited resources if we know problems exist in the current main branch so that they do not need to check whether the code sample produces a bug in the next release.

If the issue has not been resolved, please file it in the issue tracker.

Expected Output

There isn't a fixed expected output for this, as it will depend on the data, model configurations, etc. In any case, this type of variations seems unreasonable and it doesn't match the behavior when using other packages (like H2O for example).

Output of import statsmodels.api as sm; sm.show_versions()

[paste the output of ``import statsmodels.api as sm; sm.show_versions()`` here below this line] INSTALLED VERSIONS ------------------ Python: 3.9.5.final.0 statsmodels =========== Installed: 0.12.2 (C:\Users\alias\repos\feature-selection\.venv\lib\site-packages\statsmodels) Required Dependencies ===================== cython: Not installed numpy: 1.20.2 (C:\Users\alias\repos\feature-selection\.venv\lib\site-packages\numpy) scipy: 1.6.2 (C:\Users\alias\repos\feature-selection\.venv\lib\site-packages\scipy) pandas: 1.2.3 (C:\Users\alias\repos\feature-selection\.venv\lib\site-packages\pandas) dateutil: 2.8.1 (C:\Users\alias\repos\feature-selection\.venv\lib\site-packages\dateutil) patsy: 0.5.1 (C:\Users\alias\repos\feature-selection\.venv\lib\site-packages\patsy) Optional Dependencies ===================== matplotlib: 3.4.2 (C:\Users\alias\repos\feature-selection\.venv\lib\site-packages\matplotlib) backend: module://ipykernel.pylab.backend_inline cvxopt: Not installed joblib: 1.0.1 (C:\Users\alias\repos\feature-selection\.venv\lib\site-packages\joblib) Developer Tools ================ IPython: 7.23.1 (C:\Users\alias\repos\feature-selection\.venv\lib\site-packages\IPython) jinja2: 3.0.1 (C:\Users\alias\repos\feature-selection\.venv\lib\site-packages\jinja2) sphinx: 3.5.4 (C:\Users\alias\repos\feature-selection\.venv\lib\site-packages\sphinx) pygments: 2.9.0 (C:\Users\alias\repos\feature-selection\.venv\lib\site-packages\pygments) pytest: 6.2.4 (C:\Users\alias\repos\feature-selection\.venv\lib\site-packages\pytest) virtualenv: 20.4.7 (C:\Users\alias\repos\feature-selection\.venv\lib\site-packages\virtualenv)
josef-pkt commented 3 years ago

adding a small L2 penalization stabilizes it for some cases, but it still jumps suddenly to zero params, and alpha=0 doesn't produce correct results.

It could be that the Tweedie loglike is not accurate enough. The default fit does not need to use loglike for the optimization.

I added a constant to the model x = sm.add_constant(x) but that doesn't change the behavior.

print("fit", model2.fit().params)
for w in np.linspace(0, 0.08, 9):
    result_ = model2.fit_regularized(L1_wt=0.99, alpha=w, maxiter=200,
                   cnvrg_tol=0.0001)
    print(w, result_.params)
fit [-0.01422211  1.00826417 -1.00142452  0.00863553  0.50976335]
0.0 [-89.70510123   0.           0.           0.           0.        ]
0.01 [ 0.          1.00207192 -0.9959421   0.00504013  0.50551591]
0.02 [ 0.          1.00211793 -0.99571097  0.00502683  0.50552728]
0.03 [ 0.          1.00236479 -0.99468019  0.          0.50592973]
0.04 [ 0.          1.00248133 -0.99396627  0.          0.50592665]
0.05 [ 0.          1.00258026 -0.99325759  0.          0.50545311]
0.06 [ 0.          1.00260533 -0.99279867  0.00382331  0.50470115]
0.07 [ 0.          1.00277812 -0.9918399   0.          0.5045055 ]
0.08 [0. 0. 0. 0. 0.]
josef-pkt commented 3 years ago

aside: warning if not converged, uses plain warning that warns only once.

should use our ConvergenceWarning which warns always (AFAIR) GLM.fit with irls does not warn at all, scipy optimizers print convergence statement by default with disp options.

statsmodels\genmod\generalized_linear_model.py:1323: UserWarning: Elastic net fitting did not converge warnings.warn("Elastic net fitting did not converge")

        if not result.converged:
            warnings.warn("Elastic net fitting did not converge")
diegodebrito commented 3 years ago

Hi @josef-pkt, thanks for your answers and all the work you've done porting this sort of tool to Python!

Doing some more investigation, I noticed that the score is being used on the optimization. From my understanding, this makes sense, since you are optimizing the log-likelihood + the penalty term.

Looking at some references, I see that the dispersion is usually not present on the penalized likelihood. A few examples:

https://cran.r-project.org/web/packages/glmnet/vignettes/glmnetFamily.pdf (first equation) https://web.stanford.edu/~hastie/THESES/meeyoung_park.pdf (equation 3.3) https://scholarworks.rit.edu/cgi/viewcontent.cgi?referer=&httpsredir=1&article=2822&context=article (equation 5)

I'm wondering if the sensitive results are being caused by the estimate of dispersion that is implicit on the score calculation. I know that Statsmodels uses the x2 Pearson estimator, and it can be quite high sometimes.

I did a few tests here changing the scale parameter manually, and I noticed that the coefficients are more stable if I use a smaller scale value. Something like this:

result2 = model2.fit_regularized(L1_wt=1, alpha=0.5, maxiter=600, cnvrg_tol=0.001, score_kwds={"scale": scale})

Would this explain some of the behavior? Could the large scale factor be reducing the importance of the likelihood term on the penalization?

I would really appreciate any references on this (particularly on solving this problem for non-logistic glm models, to see something more complicated).

Thanks again for your work and for taking a look at this!

diegodebrito commented 3 years ago

Another comment that might be relevant: glmnet (R) seems to be using working response and weights. I'm still learning all this, but it seems like it's a slightly different approach than using the score? Very similar, but not including the dispersion parameter.

https://github.com/cran/glmnet/blob/f4fc95ab49efaad9b6e1728a7c840bc6159501dc/R/glmnetFlex.R#L585

I guess my question could be summarized by: do we really need to estimate dispersion to compute fit_regularized?

josef-pkt commented 3 years ago

It is possible that there is a problem or an inconsistency across methods in the scale handling in fit_regularized.

scale handling in GLM is tricky because we want to include it but don't really need it during optimization.

We had problems for fit with the scale handling but now _fit_gradient explicitly sets the scale temporarily to 1, so it's ignored during optimization, but then correctly estimated and used for inference.

fit_regularized doesn't do anything explicitly with scale, so scale could possibly differ between loglike and score.

trying out in my notebook with approximately your example fixing scale doesn't seem to help, it doesn't change result of full penalization to 0 if L1_wt=1

I fixed scale by changing the attribute model2.scaletype = 1 before calling fit_regularized.

josef-pkt commented 3 years ago

our fit_elasticnet in statsmodels.base is generic and uses scipy brent for single parameter optimization for coord_descent. We don't have an irls equivalent as optimizer for elastic net in GLM.

My guess is that there is something that the optimizer goes too fast to a small active set, and then doesn't include those parameters anymore. But I never looked really closely at that code.

for checking: using the unpenalized params as start_params for L1 penalization works, even with much large alpha

with fixed scale:

model2.fit_regularized(L1_wt=1, alpha=w, maxiter=200,
                   cnvrg_tol=0.0001, start_params=result1.params)
print("fit", model2.fit().params)
model2.scaletype = 40.
for w in np.linspace(0, 0.8, 9):
    result_ = model2.fit_regularized(L1_wt=1, alpha=w, maxiter=200,
                   cnvrg_tol=0.0001, start_params=result1.params)
    print(w, result_.params)
fit [-0.01422211  1.00826417 -1.00142452  0.00863553  0.50976335]
0.0 [-0.01422211  1.00826417 -1.00142452  0.00863553  0.50976335]
0.1 [-0.08731369  1.02946015 -1.01962233  0.00920581  0.51814452]
0.2 [-0.07553214  1.02149855 -1.00972095  0.          0.51153246]
0.30000000000000004 [ 0.          0.98957951 -0.98197254  0.          0.50820342]
0.4 [ 0.          0.98468465 -0.97712582  0.          0.50904197]
0.5 [ 0.          0.98521377 -0.96753847  0.          0.48486038]
0.6000000000000001 [ 0.          0.98130645 -0.9612959   0.          0.47991791]
0.7000000000000001 [ 0.          0.97731195 -0.95471168  0.          0.47474747]
0.8 [ 0.          0.97313611 -1.00464121  0.          0.46737922]

and without fixed scale

print("fit", model2.fit().params)
#model2.scaletype = 40.
for w in np.linspace(0, 0.8, 9):
    result_ = model2.fit_regularized(L1_wt=1, alpha=w, maxiter=200,
                   cnvrg_tol=0.0001, start_params=result1.params)
    print(w, result_.params)
fit [-0.01422211  1.00826417 -1.00142452  0.00863553  0.50976335]
0.0 [-0.01422211  1.00826417 -1.00142452  0.00863553  0.50976335]
0.1 [-0.08731369  1.02946015 -1.01962233  0.00920581  0.51814452]
0.2 [-0.07553214  1.02149855 -1.00972095  0.          0.51153246]
0.30000000000000004 [ 0.          0.98957951 -0.98197254  0.          0.50820342]
0.4 [ 0.          0.98468465 -0.97712582  0.          0.50904197]
0.5 [ 0.          0.98521377 -0.96753847  0.          0.48486038]
0.6000000000000001 [ 0.          0.98130645 -0.9612959   0.          0.47991791]
0.7000000000000001 [ 0.          0.97731195 -0.95471168  0.          0.47474747]
0.8 [ 0.          0.97313611 -1.00464121  0.          0.46737922]

parameter estimates drop to zero when alpha is around 1.5

josef-pkt commented 3 years ago

BTW: R glmnet is GPL which is not compatible with our BSD-3 license. So, we are not allowed to look at or translate their code.

josef-pkt commented 3 years ago

@kshedden

josef-pkt commented 3 years ago

warm start when increasing penalty weights also works up to around 1.5

print("fit", model2.fit().params)
#model2.scaletype = 40.
start_params = result1.params
for w in np.linspace(0, 1.5, 9):
    result_ = model2.fit_regularized(L1_wt=1, alpha=w, maxiter=200,
                   cnvrg_tol=0.0001, start_params=start_params) #result1.params)
    start_params = result_.params
    print(w, result_.params)
fit [-0.01422211  1.00826417 -1.00142452  0.00863553  0.50976335]
0.0 [-0.01422211  1.00826417 -1.00142452  0.00863553  0.50976335]
0.1875 [-0.07669628  1.02233972 -1.0108411   0.          0.51238216]
0.375 [ 0.          0.98591468 -0.97835096  0.          0.50881631]
0.5625 [ 0.          1.01207304 -0.95702432  0.          0.4793386 ]
0.75 [ 0.          0.97689093 -0.95305929  0.          0.47034834]
0.9375 [ 0.          0.95596966 -0.94703857  0.          0.45759647]
1.125 [ 0.          1.02562342 -0.90532962  0.          0.43927843]
1.3125 [ 0.          0.9226101  -0.91569423  0.          0.41404356]
1.5 [0. 0. 0. 0. 0.]

In this example parameters index 1, 2, 4 are highly significant in unpenalized regression, It's not easy to pull them smoothly towards zero.

kshedden commented 3 years ago

I think the issue is that our score and hessian computations use identities that only hold for likelihoods derived from proper probability distributions. Tweedie EQS is not such a thing. The following code suggests that the Tweedie gradient and Hessian are wrong (switch the family to Poisson and everything checks out). We may need to implement special-case score and hessian computations for Tweedie.

import numdifftools as nd
fam = sm.families.Tweedie(var_power=p, eql=True)
model = sm.GLM(y, x, family=fam)
result = model.fit()
ngrad = nd.Gradient(lambda x: model.loglike(x))(result.params)
agrad = model.score(result.params) 
nhess = nd.Hessian(lambda x: model.loglike(x))(result.params)
ahess = model.hessian(result.params)
josef-pkt commented 3 years ago

the above also fails for other models with scale not fixed, e.g. for gamma

It looks like after fixing the scale, the hessian agrees, but not the score function I'm evaluating away from mle parameter, so score is not zero plus noise.

import numdifftools as nd
fam = sm.families.Tweedie(var_power=p, eql=True)
#fam = sm.families.Gamma(link=sm.families.links.log())
model = sm.GLM(y, x, family=fam)
model.scaletype = 1.
result = model.fit()
model.scaletype = 1.
a = 1.1
ngrad = nd.Gradient(lambda x: model.loglike(x))(result.params * a)
agrad = model.score(result.params * a) 
nhess = nd.Hessian(lambda x: model.loglike(x))(result.params * a)
ahess = model.hessian(result.params * a)

np.max(np.abs(ngrad - agrad)), np.max(np.abs(nhess - ahess))
(375.00000000000125, 6.068603397579864e-10)

aside: fit resets the scaletype, independent of what it already is.

josef-pkt commented 3 years ago

something strange with tweedie loglike

Even if it is an extended quasi-likelihood, it should still reduce to estimating the MLE for fixed scale.

using an optimizer that uses only loglike does not reproduce MLE (MLE estimate based only on derivatives, irls and newton) result_g = model.fit(method="nm", scale=1, start_params=result_n.params, maxiter=5000, maxfun=5000)

diegodebrito commented 3 years ago

I fixed scale by changing the attribute model2.scaletype = 1 before calling fit_regularized.

@josef-pkt I found something that could help explain the issue hopefully.

Inside the elastic_net.py module, there is a step where a function called npscore is defined (https://github.com/statsmodels/statsmodels/blob/1a579185724949cf671705ae4f743b449977cd5f/statsmodels/base/elastic_net.py#L47).

Notice how it passes new arguments to score method , so I maybe it is carrying the attribute model2.scaletype like you defined above! That is why I passed a dictionary with score_kwds in my examples (because I noticed that that dictionary is used when defining npscore).

If you run something like this for example:

for alpha in np.linspace(0, 0.1, 11):
    result = model2.fit_regularized(L1_wt=1, alpha=alpha, maxiter=200, cnvrg_tol=0.01, score_kwds={"scale": 1})
    print(f"alpha = {alpha}")
    print(result.params)

You will get results like these:
alpha = 0.0
[ 1.09322585 -0.9727067   0.02457221  0.44775905]
alpha = 0.01
[ 1.08966576 -0.97402498  0.01705443  0.44316493]
alpha = 0.02
[ 1.08577813 -0.97541566  0.00955056  0.43861726]
alpha = 0.03
[ 1.0816298  -0.97712266  0.00638619  0.43399861]
alpha = 0.04
[ 1.07736718 -0.97912454  0.00786077  0.42918138]
alpha = 0.05
[ 1.06405853 -0.98352536  0.00743217  0.42897913]
alpha = 0.06
[ 1.05937738 -0.9849781   0.          0.42479757]
alpha = 0.07
[ 1.05456426 -0.97926265  0.          0.42003413]
alpha = 0.08
[ 1.04138218 -0.98317384  0.          0.41902822]
alpha = 0.09
[ 1.04432254 -0.98180734  0.          0.40996421]
alpha = 0.1
[ 1.03897094 -0.98311209  0.          0.40497876]

If you just set model2.scaletype = 1 before calling the loop, it produces the same results as before (unstable):

model2.scaletype = 1.0
for alpha in np.linspace(0, 0.1, 11):
    result = model2.fit_regularized(L1_wt=1, alpha=alpha, maxiter=200, cnvrg_tol=0.01)
    print(f"alpha = {alpha}")
    print(result.params)

alpha = 0.0
[ 1.00392896 -0.99552302  0.00816112  0.50514782]
alpha = 0.01
[ 1.00333492 -0.99568883  0.00747765  0.50524213]
alpha = 0.02
[ 1.00299068 -0.99575727  0.00677621  0.50532022]
alpha = 0.03
[ 1.00332497 -0.99493846  0.00608081  0.50539896]
alpha = 0.04
[ 1.00236527 -0.99430337  0.00495188  0.50555515]
alpha = 0.05
[ 1.00200547 -0.99376493  0.00497748  0.50569831]
alpha = 0.06
[ 1.00199814 -0.99606201  0.0040994   0.50558269]
alpha = 0.07
[ 1.00128377 -0.99591126  0.          0.50431653]
alpha = 0.08
[0. 0. 0. 0.]
alpha = 0.09
[0. 0. 0. 0.]
alpha = 0.1
[0. 0. 0. 0.]
diegodebrito commented 3 years ago

It is hard to see what is happening to the scale parameter... my point from the comment above: it seems like the score calculated during the elastic_net optimization was not using scale=1 when passing scaletype=1 like you did. Please correct me if I'm wrong!

diegodebrito commented 3 years ago

Ok, another piece to the puzzle: inside fit_elasticnet, univariate models are defined. Since these are defined from scratch, they will have scaletype=None, so whenever score is called for these models, the scale parameter will be estimated (instead of using 1 like we want for some of the examples above).

https://github.com/statsmodels/statsmodels/blob/1a579185724949cf671705ae4f743b449977cd5f/statsmodels/base/elastic_net.py#L192

Using score_kwds works because this dictionary is passed to model.score inside the npscore (gradient) function.

josef-pkt commented 3 years ago

Thanks for investigating this. I think you are right about the scale part. We will have to add those keyword options inside GLM.fit_regularized.

Did you check fit_regularized of any of the other families with scale set at 1, gaussian, gamma, ...? The problem with scale handling should show up in the same or similar way with those families also.

Tweedie still has the additional problem that the log-likelihood function is just an approximation and not the true loglike.

diegodebrito commented 3 years ago

My pleasure, @josef-pkt!

I tested with gamma and the behavior is similar. Some code comparing both:

import statsmodels.api as sm
import numpy as np
from statsmodels.datasets import cpunish
from sklearn.preprocessing import MinMaxScaler

# loading and scaling the data
data = cpunish.load_pandas()
scaler = MinMaxScaler()
X = sm.add_constant(scaler.fit_transform(data.exog))

# gamma distribution
family = sm.families.Gamma(link=sm.families.links.log())
model = sm.GLM(endog=data.endog, exog=X, family=family)

for alpha in np.linspace(0, 0.1, 11):
    result = model.fit_regularized(L1_wt=1, 
                                    alpha=alpha,)
    print(f"alpha = {alpha}")
    print(result.params.values)

alpha = 0.0
[ 1.00939923  3.06580906 -0.00383133 -2.02574066 -0.32918837  1.58922577
 -2.63288559]
alpha = 0.01
[ 0.93920064  2.54766162  0.49041371  0.         -1.15889221  0.78228651
 -2.6785279 ]
alpha = 0.02
[ 0.90852312  3.12739928  0.         -1.87698014 -0.24372091  1.55441484
 -2.63880967]
alpha = 0.03
[0.6404057  0.         0.         0.         0.         1.71336537
 0.        ]
alpha = 0.04
[0. 0. 0. 0. 0. 0. 0.]
alpha = 0.05
[0. 0. 0. 0. 0. 0. 0.]
alpha = 0.06
[0. 0. 0. 0. 0. 0. 0.]
alpha = 0.07
[0. 0. 0. 0. 0. 0. 0.]
alpha = 0.08
[0. 0. 0. 0. 0. 0. 0.]
alpha = 0.09
[0. 0. 0. 0. 0. 0. 0.]
alpha = 0.1
[0. 0. 0. 0. 0. 0. 0.]

for alpha in np.linspace(0, 0.1, 11):
    result = model.fit_regularized(L1_wt=1, 
                                    alpha=alpha,
                                    score_kwds={"scale": 1})
    print(f"alpha = {alpha}")
    print(result.params.values)

alpha = 0.0
[ 1.27473353  2.62040181 -0.21150152 -1.97858454 -0.38716299  1.59247365
 -2.47523349]
alpha = 0.01
[ 1.37574642  2.01986334 -0.13379665 -1.42900167 -0.52673751  1.34598802
 -2.23650554]
alpha = 0.02
[ 1.44243789  1.36962666 -0.10412645 -1.00786172 -0.61426267  1.34079782
 -1.91517913]
alpha = 0.03
[ 0.99424751  2.78925445  0.         -1.13073303 -0.47419459  1.23462334
 -2.59195276]
alpha = 0.04
[ 1.44220862  0.          0.         -1.07839655 -0.31967648  1.57910856
 -0.74003859]
alpha = 0.05
[ 1.40896323  0.          0.          0.         -0.78595435  1.3477018
 -0.94750249]
alpha = 0.06
[ 1.29324222  0.          0.          0.         -0.65807816  1.40103612
 -0.82548472]
alpha = 0.07
[ 1.21346773  0.          0.          0.         -0.49987488  1.42137742
 -0.78088003]
alpha = 0.08
[ 1.05090512  0.          0.          0.         -0.36170904  1.49860479
 -0.57473044]
alpha = 0.09
[ 0.90041326  0.          0.          0.         -0.22721808  1.5815201
 -0.62570787]
alpha = 0.1
[ 0.75297452  0.          0.          0.          0.          1.61722715
 -0.25903994]
kshedden commented 3 years ago

This should be fixed by #7571. I will merge in a few days if no objections.