glm-tools / pyglmnet

Python implementation of elastic-net regularized generalized linear models
http://glm-tools.github.io/pyglmnet/
MIT License
280 stars 84 forks source link

Fix cdfast convergence #278

Closed ghost closed 5 years ago

ghost commented 5 years ago

fixes #275

changes test_glmnet to include tests with with nonzero lambda. Changes GLM(solver='cdfast').fit() execution path to recalculate z each time since it is actually invalidated on each iteration.

To speed up my testing I also added a manifest to make the package directly pip-installable, which is likely a useful addition even though it's not strictly related to the fix.

ghost commented 5 years ago

re. flake8 errors, I addressed everything that's touched by this PR (using git diff main/master -U0 | flake8 --diff -). The other flake8 issues are leftovers from master.

Edit: this was actually easy to resolve since the remaining issues are warnings that conflict with the package style (line break before/after binary operator, and latex math in Sphinx documentation)

codecov-io commented 5 years ago

Codecov Report

Merging #278 into master will decrease coverage by 0.32%. The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #278      +/-   ##
==========================================
- Coverage   74.47%   74.15%   -0.33%     
==========================================
  Files           4        4              
  Lines         623      623              
  Branches      127      127              
==========================================
- Hits          464      462       -2     
- Misses        123      125       +2     
  Partials       36       36
Impacted Files Coverage Δ
pyglmnet/pyglmnet.py 80.28% <100%> (-0.41%) :arrow_down:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update c962c84...7dfe345. Read the comment docs.

jasmainak commented 5 years ago

@peterfoley do you mind if I push a commit to conform to the coding style so we can merge this?

ghost commented 5 years ago

@jasmainak I just enabled "allow edits from maintainers" so you can push commits.

ghost commented 5 years ago

@jasmainak I massively pared down this PR since the changes other than cdfast caching were resolved by other PRs over the past few months. Could you re-review?

As a reminder for why z from _cdfast shouldn't be cached outside _cdfast itself -- within fit(), immediately after _cdfast is called and returns beta and z, beta is updated by self._prox, which invalidates the cached z

jasmainak commented 5 years ago

@peterfoley thanks for getting back to this. I pushed a test. There is one problem though. On master, if I do:

$ py.test --cov=pyglmnet tests/ --verbose --durations=10

I get:

========================== slowest 10 test durations ===========================
9.38s call     tests/test_pyglmnet.py::test_random_state_consistency
9.19s call     tests/test_pyglmnet.py::test_cv
8.62s call     tests/test_pyglmnet.py::test_glmcv[probit]
2.77s call     tests/test_pyglmnet.py::test_glmcv[poisson]
2.12s call     tests/test_pyglmnet.py::test_glmcv[softplus]
1.85s call     tests/test_pyglmnet.py::test_glmcv[binomial]
1.78s call     tests/test_pyglmnet.py::test_glmcv[gamma]
1.66s call     tests/test_pyglmnet.py::test_glmcv[gaussian]
0.37s call     tests/test_pyglmnet.py::test_fetch_datasets
0.31s call     tests/test_metrics.py::test_deviance

On this branch, I get:

=========================== slowest 10 test durations ============================
65.17s call     tests/test_pyglmnet.py::test_glmcv[probit]
18.17s call     tests/test_pyglmnet.py::test_glmcv[poisson]
9.96s call     tests/test_pyglmnet.py::test_glmcv[softplus]
9.26s call     tests/test_pyglmnet.py::test_random_state_consistency
9.00s call     tests/test_pyglmnet.py::test_glmcv[binomial]
8.48s call     tests/test_pyglmnet.py::test_cv
7.54s call     tests/test_pyglmnet.py::test_glmcv[gaussian]
3.67s call     tests/test_pyglmnet.py::test_fetch_datasets
1.91s call     tests/test_pyglmnet.py::test_glmcv[gamma]
1.78s call     tests/test_pyglmnet.py::test_compare_sklearn

It's not a proper benchmark but it seems a big slowdown, no?

jasmainak commented 5 years ago

By the way, have you looked at this: https://github.com/glm-tools/pyglmnet/issues/289 ?

jasmainak commented 5 years ago

Regardless of my comment above about the slowdown, I think your changes are technically correct @peterfoley605 . I think the slowdown might be related to the stopping criteria?

If you can update the whats_new.rst with a clear explanation of what you did, I would be happy to merge this PR now.

jasmainak commented 5 years ago

Also, just to be sure I understand correctly, there is still caching within one cycle ... just that there is no caching from one cycle of Newton updates to the next. That seems to me to be the correct thing to do.

ghost commented 5 years ago

The large pytest slowdown is surprising given that it adds only one relatively simple linear algebra operation per iteration (you're correct that it still caches within _cdfast).

I replicated the pytest slowdown on my machine, but I haven't been able to replicate it outside pytest by running the contents of test_glmnet in a standalone script. It seems like a pytest-specific thing, perhaps related to GC or memory tracing, since z's memory is now allocated with each iteration?

I'll go ahead with the whats_new.rst update and we can get this merged.

jasmainak commented 5 years ago

I think it's a question of how fast the algorithm converges rather than the number of flops. We'd need to plot convergence curves to nail it down.

But anyway, I merged it because it fixes an important problem. Thanks for the contribution!

jasmainak commented 5 years ago

running the contents of test_glmnet in a standalone script

could you share this script on a gist or something?

ghost commented 5 years ago

@jasmainak I moved the performance reality-check into a single ipython notebook that creates fresh conda environments and tests two commits: https://gist.github.com/peterfoley605/53d6faa0c491ffbc8a682ff147d1d6e0 or if that's not rendering: https://nbviewer.jupyter.org/gist/peterfoley605/53d6faa0c491ffbc8a682ff147d1d6e0

jasmainak commented 5 years ago

@peterfoley sorry in the delay getting back here. I could not open your notebook locally on my computer. And the nbviewer link doesn't seem to work either ...