glm-tools / pyglmnet

Python implementation of elastic-net regularized generalized linear models
http://glm-tools.github.io/pyglmnet/
MIT License
279 stars 82 forks source link

GLMCV results do not match GLM with same parameters and optimal lambda #377

Closed whoisnnamdi closed 4 years ago

whoisnnamdi commented 4 years ago

Problem Optimal penalization parameter (lambda) found via GLMCV does not yield similar results when plugged into GLM with otherwise similar parameters.

Example Script below mostly follow the group lasso example code in the docs except modified for regular lasso instead of group.

from pyglmnet import GLMCV
from pyglmnet import GLM
from pyglmnet.datasets import fetch_group_lasso_datasets
from sklearn.model_selection import train_test_split

df, group_idxs = fetch_group_lasso_datasets()

X = df[df.columns.difference(["Label"])].values
y = df.loc[:, "Label"].values

Xtrain, Xtest, ytrain, ytest = \
    train_test_split(X, y, test_size=0.2, random_state=42)

# Setup lasso cv model
gl_glm = GLMCV(distr="binomial", tol=1e-3,
               score_metric="pseudo_R2",
               alpha=1.0, learning_rate=1, max_iter=100, cv=3, verbose=True)

# Fit the model
gl_glm.fit(Xtrain, ytrain)

# Save the optimal lambda based on highest score
opt_lambda = gl_glm.reg_lambda[gl_glm.scores_.index(max(gl_glm.scores_))]
print(opt_lambda)    # 0.010000000000000007

# Setup lasso model using optimal lambda found earlier, all other relevant parameters kept the same
glm = GLM(distr="binomial", tol=1e-3, reg_lambda=opt_lambda,
            score_metric="pseudo_R2",
            alpha=1.0, learning_rate=1, max_iter=100, verbose=True)

# Fit the model
glm.fit(Xtrain, ytrain)

# Compare beta coefficients
print(gl_glm.beta_ - glm.beta_)

Results You can tweak the learning rate and iterations of the second model, but the results will never match those of GLMCV even with many iterations and a low learning rate.

I understand there may be some inherent instability given the way that convergence is reached, but this feels like too much. Especially since an important purpose of lasso is to do feature selection, if certain variables have non-zero coefficients in one model but not the other, this somewhat defeats that purpose.

Thank you for looking into this.

jasmainak commented 4 years ago

The differences are really tiny if you reduce tol and increase max_iter. See below:

from pyglmnet import GLMCV
from pyglmnet import GLM
from pyglmnet.datasets import fetch_group_lasso_datasets
from sklearn.model_selection import train_test_split

import matplotlib.pyplot as plt

df, group_idxs = fetch_group_lasso_datasets()

X = df[df.columns.difference(["Label"])].values
y = df.loc[:, "Label"].values

Xtrain, Xtest, ytrain, ytest = \
    train_test_split(X, y, test_size=0.2, random_state=42)

# Setup lasso cv model
gl_glm = GLMCV(distr="binomial", tol=1e-7,
               score_metric="pseudo_R2",
               alpha=1.0, learning_rate=1, max_iter=200, cv=3, verbose=True)

# Fit the model
gl_glm.fit(Xtrain, ytrain)

# Save the optimal lambda based on highest score
opt_lambda = gl_glm.reg_lambda[gl_glm.scores_.index(max(gl_glm.scores_))]
print(opt_lambda)    # 0.010000000000000007

# Setup lasso model using optimal lambda found earlier, all other relevant parameters kept the same
glm = GLM(distr="binomial", tol=1e-7, reg_lambda=opt_lambda,
            score_metric="pseudo_R2",
            alpha=1.0, learning_rate=1., max_iter=200, verbose=True)

# Fit the model
glm.fit(Xtrain, ytrain)

# Compare beta coefficients
print(gl_glm.beta_ - glm.beta_)

plt.plot(gl_glm.beta_)
plt.plot(glm.beta_, 'r')
plt.show()

I see this: Figure_1

This is specially the case for the second model which is not helped by warm start.

whoisnnamdi commented 4 years ago

Thanks, I noticed this too - that changing the the tol and max_iter parameters could get the coefficient values closer.

That said, there are a few instances in your example where GLMCV assigns a non-zero value while GLM does not, or vice versa, which matters if one is using this for variable selection. Is this effectively unavoidable?

jasmainak commented 4 years ago

Well, you need to push the convergence even further. You'll see that there is this warning currently:

/Users/mainak/Documents/github_repos/pyglmnet/pyglmnet/pyglmnet.py:900: UserWarning: Reached max number of iterations without convergence.
  "Reached max number of iterations without convergence.")

Use max_iter=500 and you see there is almost no difference between the two.

Figure_1

I agree it's a bit hard to debug this. Wouldn't be opposed to adding a method plot_convergence to the GLM object if that helps you.

whoisnnamdi commented 4 years ago

Thanks for the help. Yes a plot_convergence method would be super helpful if it's not too difficult to implement!