Closed ghost closed 5 years ago
re. flake8 errors, I addressed everything that's touched by this PR (using git diff main/master -U0 | flake8 --diff -
). The other flake8 issues are leftovers from master
.
Edit: this was actually easy to resolve since the remaining issues are warnings that conflict with the package style (line break before/after binary operator, and latex math in Sphinx documentation)
Merging #278 into master will decrease coverage by
0.32%
. The diff coverage is100%
.
@@ Coverage Diff @@
## master #278 +/- ##
==========================================
- Coverage 74.47% 74.15% -0.33%
==========================================
Files 4 4
Lines 623 623
Branches 127 127
==========================================
- Hits 464 462 -2
- Misses 123 125 +2
Partials 36 36
Impacted Files | Coverage Δ | |
---|---|---|
pyglmnet/pyglmnet.py | 80.28% <100%> (-0.41%) |
:arrow_down: |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update c962c84...7dfe345. Read the comment docs.
@peterfoley do you mind if I push a commit to conform to the coding style so we can merge this?
@jasmainak I just enabled "allow edits from maintainers" so you can push commits.
@jasmainak I massively pared down this PR since the changes other than cdfast caching were resolved by other PRs over the past few months. Could you re-review?
As a reminder for why z
from _cdfast
shouldn't be cached outside _cdfast
itself -- within fit()
, immediately after _cdfast
is called and returns beta
and z
, beta
is updated by self._prox
, which invalidates the cached z
@peterfoley thanks for getting back to this. I pushed a test. There is one problem though. On master, if I do:
$ py.test --cov=pyglmnet tests/ --verbose --durations=10
I get:
========================== slowest 10 test durations ===========================
9.38s call tests/test_pyglmnet.py::test_random_state_consistency
9.19s call tests/test_pyglmnet.py::test_cv
8.62s call tests/test_pyglmnet.py::test_glmcv[probit]
2.77s call tests/test_pyglmnet.py::test_glmcv[poisson]
2.12s call tests/test_pyglmnet.py::test_glmcv[softplus]
1.85s call tests/test_pyglmnet.py::test_glmcv[binomial]
1.78s call tests/test_pyglmnet.py::test_glmcv[gamma]
1.66s call tests/test_pyglmnet.py::test_glmcv[gaussian]
0.37s call tests/test_pyglmnet.py::test_fetch_datasets
0.31s call tests/test_metrics.py::test_deviance
On this branch, I get:
=========================== slowest 10 test durations ============================
65.17s call tests/test_pyglmnet.py::test_glmcv[probit]
18.17s call tests/test_pyglmnet.py::test_glmcv[poisson]
9.96s call tests/test_pyglmnet.py::test_glmcv[softplus]
9.26s call tests/test_pyglmnet.py::test_random_state_consistency
9.00s call tests/test_pyglmnet.py::test_glmcv[binomial]
8.48s call tests/test_pyglmnet.py::test_cv
7.54s call tests/test_pyglmnet.py::test_glmcv[gaussian]
3.67s call tests/test_pyglmnet.py::test_fetch_datasets
1.91s call tests/test_pyglmnet.py::test_glmcv[gamma]
1.78s call tests/test_pyglmnet.py::test_compare_sklearn
It's not a proper benchmark but it seems a big slowdown, no?
By the way, have you looked at this: https://github.com/glm-tools/pyglmnet/issues/289 ?
Regardless of my comment above about the slowdown, I think your changes are technically correct @peterfoley605 . I think the slowdown might be related to the stopping criteria?
If you can update the whats_new.rst
with a clear explanation of what you did, I would be happy to merge this PR now.
Also, just to be sure I understand correctly, there is still caching within one cycle ... just that there is no caching from one cycle of Newton updates to the next. That seems to me to be the correct thing to do.
The large pytest slowdown is surprising given that it adds only one relatively simple linear algebra operation per iteration (you're correct that it still caches within _cdfast).
I replicated the pytest slowdown on my machine, but I haven't been able to replicate it outside pytest by running the contents of test_glmnet
in a standalone script. It seems like a pytest-specific thing, perhaps related to GC or memory tracing, since z
's memory is now allocated with each iteration?
I'll go ahead with the whats_new.rst
update and we can get this merged.
I think it's a question of how fast the algorithm converges rather than the number of flops. We'd need to plot convergence curves to nail it down.
But anyway, I merged it because it fixes an important problem. Thanks for the contribution!
running the contents of test_glmnet in a standalone script
could you share this script on a gist or something?
@jasmainak I moved the performance reality-check into a single ipython notebook that creates fresh conda environments and tests two commits: https://gist.github.com/peterfoley605/53d6faa0c491ffbc8a682ff147d1d6e0 or if that's not rendering: https://nbviewer.jupyter.org/gist/peterfoley605/53d6faa0c491ffbc8a682ff147d1d6e0
@peterfoley sorry in the delay getting back here. I could not open your notebook locally on my computer. And the nbviewer link doesn't seem to work either ...
fixes #275
changes test_glmnet to include tests with with nonzero lambda. Changes
GLM(solver='cdfast').fit()
execution path to recalculatez
each time since it is actually invalidated on each iteration.To speed up my testing I also added a manifest to make the package directly pip-installable, which is likely a useful addition even though it's not strictly related to the fix.