glm-tools / pyglmnet

Python implementation of elastic-net regularized generalized linear models
http://glm-tools.github.io/pyglmnet/
MIT License
279 stars 83 forks source link

failing comparison of elastic net with scikit-learn #275

Closed lorentzenchr closed 5 years ago

lorentzenchr commented 5 years ago

I tried to compare pyglmnet.GLM with ElasticNet from scikit-learn and could not get it work. Code to reproduce:

import numpy as np
from sklearn.datasets.samples_generator import make_regression
from sklearn.linear_model import ElasticNet
from pyglmnet import GLM

def rmse(a, b):
    return np.sqrt(np.mean((a - b) ** 2))

X, Y, coef_ = make_regression(
    n_samples=1000, n_features=1000,
    noise=0.1, n_informative=10, coef=True,
    random_state=42)

alpha = 0.1
l1_ratio=0.5

sk = ElasticNet(alpha=alpha, l1_ratio=l1_ratio, tol = 1e-5).fit(X, Y)
pg = GLM(distr='gaussian', alpha=l1_ratio, reg_lambda=alpha, solver='cdfast', tol = 1e-5).fit(X, Y)
print('in-sample rmse sklearn = {}, pyglmnet = {}'.format(rmse(Y, sk.predict(X)), rmse(Y, pg.predict(X))))

Result: in-sample rmse sklearn = 12.756054997216442, pyglmnet = 161.03496460055504

jasmainak commented 5 years ago

Thanks for the bug report, and for finding the right tab ;)

I modified your code so that it didn't fail on an import. I confirm we have a problem. It seems the issue is with the stopping criteria. @pavanramkumar do you have time to look since you implemented this last? Let me try to help you:

import numpy as np
from sklearn.datasets.samples_generator import make_regression
from sklearn.linear_model import ElasticNet

from pyglmnet import GLM
from pyglmnet import _loss

import matplotlib.pyplot as plt

def rmse(a, b):
    return np.sqrt(np.mean((a - b) ** 2))

X, Y, coef_ = make_regression(
    n_samples=1000, n_features=1000,
    noise=0.1, n_informative=10, coef=True,
    random_state=42, tail_strength=0)

loss_trace = list()

alpha = 0.1
l1_ratio = 0.5

def callback(beta):
    Tau = None
    eta = 2.0
    group = None

    loss_trace.append(
        _loss('gaussian', l1_ratio, Tau, alpha,
              X, Y, eta, group, beta))

sk = ElasticNet(alpha=alpha, l1_ratio=l1_ratio,
                tol=1e-5).fit(X, Y)
pg = GLM(distr='gaussian', alpha=l1_ratio, reg_lambda=alpha,
         solver='cdfast', tol=1e-5, learning_rate=1e-2,
         callback=callback, max_iter=1000).fit(X, Y)
y_sk = sk.predict(X)
y_pg = pg.predict(X)

plt.plot(loss_trace)
plt.show()

print('in-sample rmse sklearn = {}, pyglmnet = {}'.format(
    rmse(Y, y_sk), rmse(Y, y_pg)))

If you set max_iter=7 you get pretty much what sklearn gives.

jasmainak commented 5 years ago

@pavanramkumar did you get a chance to get back to this?

ghost commented 5 years ago

Within GLM.fit() it looks like the z cache gets out-of-sync with beta, which violates assumptions self._cdfast() makes about z.

Recalculating z = beta[0] + np.dot(X, beta[1:]) immediately before the call to self._cdfast() seems to prevent divergence. I suspect that it's because beta[1:] is later modified by self._prox() but the z cache isn't updated to reflect those changes.

jasmainak commented 5 years ago

hmm ... but that's weird that it's not caught by any of the tests. We already have tests which check that the loss goes down. Could you make a PR with an updated test? Thanks.

ghost commented 5 years ago

I'm wrapping up a branch/PR with an updated test that makes the loss-checking kick in by using nonzero reg_lambda, and I've confirmed that skipping the caching of z resolves the divergence. I'll push everything up shortly.