glm-tools / pyglmnet

Python implementation of elastic-net regularized generalized linear models
http://glm-tools.github.io/pyglmnet/
MIT License
279 stars 83 forks source link

[MRG] deprecations, dependencies, convergence, examples #263

Closed pavanramkumar closed 5 years ago

pavanramkumar commented 6 years ago

We only test for scikit-learn 0.18, scipy 0.17, numpy 1.11 (see .travis.yml)

I think for the scope of our package, that is ok, given it's been quite a long time since those versions have come out.

We shouldn't strive for backward compatibility when it makes support and testing painful.

Closes #69 and #239

jasmainak commented 6 years ago

I think we should just support >=0.18 for scikit-learn. So, you can change everything to use modelselection module. Don't merge this yet. I will test it tomorrow. I think it's a little less trivial than this ...

codecov-io commented 6 years ago

Codecov Report

Merging #263 into master will not change coverage. The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #263   +/-   ##
=======================================
  Coverage   72.97%   72.97%           
=======================================
  Files           4        4           
  Lines         618      618           
  Branches      125      125           
=======================================
  Hits          451      451           
- Misses        127      128    +1     
+ Partials       40       39    -1
Impacted Files Coverage Δ
pyglmnet/pyglmnet.py 78.68% <0%> (ø) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update a7f3f2c...b03b979. Read the comment docs.

pavanramkumar commented 6 years ago

@jasmainak sure!

right now, we have a check_version function in our base.py and decide import based on that. we can significantly simplify this if we decide to support >=0.18. i'll wait for you to sort out those issues before merging.

jasmainak commented 6 years ago

@pavanramkumar so I had a look. You also need to update the examples so that they use modern sklearn API

pavanramkumar commented 6 years ago

@jasmainak pushed example changes. so no change needed to check_version in base.py and its usage in pyglmnet.py?

jasmainak commented 6 years ago

@pavanramkumar the CircleCI build was not triggered for some reason ...

jasmainak commented 6 years ago

so no change needed to check_version in base.py and its usage in pyglmnet.py?

I think we can leave it as such

jasmainak commented 6 years ago

@pavanramkumar I get the following error when I build the example:

/home/mainak/Desktop/projects/github_repos/pyglmnet/pyglmnet/pyglmnet.py:575: RuntimeWarning: invalid value encountered in greater                
  (np.abs(beta) > thresh)
WARNING: /home/mainak/Desktop/projects/github_repos/pyglmnet/examples/plot_community_crime.py failed to execute correctly:Traceback (most recent call last):
  File "/home/mainak/anaconda2/lib/python2.7/site-packages/sphinx_gallery/gen_rst.py", line 450, in execute_code_block
    exec(code_block, example_globals)
  File "<string>", line 14, in <module>
  File "/home/mainak/anaconda2/lib/python2.7/site-packages/sklearn/model_selection/_search.py", line 639, in fit
    cv.split(X, y, groups)))
  File "/home/mainak/anaconda2/lib/python2.7/site-packages/sklearn/model_selection/_split.py", line 332, in split
    for train, test in super(_BaseKFold, self).split(X, y, groups):
  File "/home/mainak/anaconda2/lib/python2.7/site-packages/sklearn/model_selection/_split.py", line 95, in split
    for test_index in self._iter_test_masks(X, y, groups):
  File "/home/mainak/anaconda2/lib/python2.7/site-packages/sklearn/model_selection/_split.py", line 634, in _iter_test_masks
    test_folds = self._make_test_folds(X, y)
  File "/home/mainak/anaconda2/lib/python2.7/site-packages/sklearn/model_selection/_split.py", line 589, in _make_test_folds
    allowed_target_types, type_of_target_y))
ValueError: Supported target types are: ('binary', 'multiclass'). Got 'continuous' instead.
pavanramkumar commented 5 years ago

@jasmainak where did you build this example? locally or on circleCI?

jasmainak commented 5 years ago

@pavanramkumar locally because as you can see CircleCI was not triggered for this pull request ...

pavanramkumar commented 5 years ago

closes #267 and some circle CI issues. expect circle CI to break here because of the community crime dataset issue

pavanramkumar commented 5 years ago

Appendix 0 here gives some insights: https://web.stanford.edu/~hastie/Papers/Glmnet_Vignette.pdf

TLDR is: set convergence threshold on change in predicted target variable

jasmainak commented 5 years ago

@pavanramkumar I can't test things on this branch because examples are broken. Which example do you suggest I try? It seems the community crime data has a broken URL. Should we upload the data on dropbox or something? I don't know about license ...

jasmainak commented 5 years ago

@pavanramkumar I ran the group lass example with a slight modification:

from pyglmnet import GLM
from pyglmnet.datasets import fetch_group_lasso_datasets
import matplotlib.pyplot as plt

df, group_idxs = fetch_group_lasso_datasets()
print(df.head())

from sklearn.model_selection import train_test_split # noqa

X = df[df.columns.difference(["Label"])].values
y = df.loc[:, "Label"].values

Xtrain, Xtest, ytrain, ytest = \
    train_test_split(X, y, test_size=0.2, random_state=42)

from pyglmnet import _loss
import numpy as np
loss_trace = list()
distr = 'binomial'
alpha = 1.0
group = group_idxs
reg_lambda = 1e-3

def callback(beta):
    Tau = None
    eta = 2.0

    loss_trace.append(
        _loss(distr, alpha, Tau, reg_lambda,
              Xtrain, ytrain, eta, np.array(group), beta))

# set up the group lasso GLM model
gl_glm = GLM(distr=distr, tol=1e-7, max_iter=1000,
             group=group_idxs, score_metric="pseudo_R2",
             alpha=alpha, reg_lambda=reg_lambda,
             callback=callback)
gl_glm.fit(Xtrain, ytrain)
plt.plot(loss_trace)
plt.show()

I got this plot:

loss

Looking at this, I would say it should not stop before 600 iterations. But when I run it with the default tol=1e-3 it stops at 40 iterations (!) That is too early ...

pavanramkumar commented 5 years ago

@jasmainak let's break up this PR into 3 PRs:

jasmainak commented 5 years ago

Appendix 0 here gives some insights: https://web.stanford.edu/~hastie/Papers/Glmnet_Vignette.pdf

This looks like a great resource! And the documentation is pretty neat

jasmainak commented 5 years ago

@pavanramkumar that sounds like a plan! Do you want to remove that last commit by interactive rebase? (Make sure to make a backup of the branch so that you can cherry-pick the commit later) Or if you need my help for git kung-fu let me know :)

codecov-io commented 5 years ago

Codecov Report

Merging #263 into master will decrease coverage by 0.76%. The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #263      +/-   ##
==========================================
- Coverage   73.74%   72.97%   -0.77%     
==========================================
  Files           4        4              
  Lines         617      618       +1     
  Branches      124      125       +1     
==========================================
- Hits          455      451       -4     
- Misses        123      128       +5     
  Partials       39       39
Impacted Files Coverage Δ
pyglmnet/pyglmnet.py 78.68% <0%> (-0.99%) :arrow_down:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 5c97535...ca45b24. Read the comment docs.

codecov-io commented 5 years ago

Codecov Report

Merging #263 into master will decrease coverage by 0.76%. The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #263      +/-   ##
==========================================
- Coverage   73.74%   72.97%   -0.77%     
==========================================
  Files           4        4              
  Lines         617      618       +1     
  Branches      124      125       +1     
==========================================
- Hits          455      451       -4     
- Misses        123      128       +5     
  Partials       39       39
Impacted Files Coverage Δ
pyglmnet/pyglmnet.py 78.68% <0%> (-0.99%) :arrow_down:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 5c97535...fce952e. Read the comment docs.

jasmainak commented 5 years ago

@pavanramkumar rebasing done. Let's wait for the CIs to be happy.

I pushed a backup of your branch here: https://github.com/jasmainak/pyglmnet/tree/deprecations_backup

jasmainak commented 5 years ago

okay PR is good to go from my end @pavanramkumar . Merging so we can get some dopamine in our brains and then work on the rest in other PRs :)