glm-tools / pyglmnet

Python implementation of elastic-net regularized generalized linear models
http://glm-tools.github.io/pyglmnet/
MIT License
279 stars 82 forks source link

[MRG] Attempt to make `GLM` compatible with scikit-learn `check_estimator` #364

Closed titipata closed 4 years ago

titipata commented 4 years ago

So far, there are 3 main parts which can be fixed in GLM class to be compatible with scikit-learn's check_estimator

1) Initial attributes 2) fit_predict function 3) scipy's expit

closes https://github.com/glm-tools/pyglmnet/issues/363

jasmainak commented 4 years ago

how are you testing? can you add the check_estimator line somewhere in the diff?

titipata commented 4 years ago

@jasmainak I did add a test script where you can check by running py.test --cov=pyglmnet tests/. Note that this will fail a bunch of original tests since I commented out fit_predict from GLM

jasmainak commented 4 years ago

Cool, FYI, I use this command:

$ pytest ./tests/test_pyglmnet.py::test_glm_estimator --pdb

that way you drop right into the console when there is an error :)

jasmainak commented 4 years ago

For whatever reason X is an array of complex numbers, that's why it is failing

titipata commented 4 years ago

@jasmainak Thanks for the handy test script. Oh yeah, I see that: z is ndarray of size (10, ) with a complex number in there...

jasmainak commented 4 years ago

I don't think it's easy to make all the checks happy without adding an sklearn dependency. Basically you have to follow what is in: https://github.com/scikit-learn-contrib/project-template/blob/master/skltemplate/_template.py. I'd like to believe it can be done though :)

titipata commented 4 years ago

@jasmainak alright, I tried adding sklearn dependency so that it follows their template style. Not sure if we also want BaseEstimator to be from sklearn also?

jasmainak commented 4 years ago

Let's try to first make it work. Use sklearn or whatever is necessary. Then we can make it nice and see if the dependency can be dropped easily :)

titipata commented 4 years ago

@jasmainak Now the error is in predict_proba, I'm not sure how to fix it.

titipata commented 4 years ago

@jasmainak alright the current commits should work with check_estimator now! The part that I cannot keep track is the _allow_refit. Does scikit-learn dependencies keep track of this fitting/re-fitting?

jasmainak commented 4 years ago

You can remove the refit logic for now. Do all tests pass if you do that? I’ll see what to do with the refit when I get a chance in a couple of hours

On Tue 18 Feb 2020 at 14:05, Titipat Achakulvisut notifications@github.com wrote:

@jasmainak https://github.com/jasmainak alright the current commits should work with check_estimator now!

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/glm-tools/pyglmnet/pull/364?email_source=notifications&email_token=ADY6FIX2I4DDDKREO2XCPL3RDQWODA5CNFSM4KXIRITKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEMDSOHQ#issuecomment-587671326, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADY6FIRQK7MBYVQKRN4QW3TRDQWODANCNFSM4KXIRITA .

-- Sent from my iPhone

titipata commented 4 years ago

@jasmainak There are 2 failed (test_random_state_consistency and test_api_input), 56 passed, 32 warning for the current commits. The error is related to Failed: DID NOT RAISE <class 'ValueError'>.

titipata commented 4 years ago

The test is now not raising errors because of the additional check_X_y. It returns nd.array back even you put the list in. So now, the function test_api_input in test_pyglmnet.py below won't raise an error:

    with pytest.raises(ValueError):
        glm.fit(X, list(y))

And the fail in test_random_state_consistency is coming from the commented out of self._allow_refit = False after fit method. Specifically, it comes from

    match = "This glm object has already been fit"
    with pytest.raises(ValueError, match=match):
        ypred_c = glm_b.fit_predict(Xtrain, ytrain)
titipata commented 4 years ago

Alright @jasmainak, I think this should be the push using sklearn dependency. If you can help a bit with _allow_refit logic, that should be it! I will leave it to you for now.

codecov-io commented 4 years ago

Codecov Report

Merging #364 into master will decrease coverage by 0.88%. The diff coverage is 90.9%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #364      +/-   ##
==========================================
- Coverage   75.66%   74.78%   -0.89%     
==========================================
  Files           4        4              
  Lines         678      686       +8     
  Branches      149      148       -1     
==========================================
  Hits          513      513              
- Misses        128      133       +5     
- Partials       37       40       +3
Impacted Files Coverage Δ
pyglmnet/pyglmnet.py 79.71% <90.9%> (-1.54%) :arrow_down:
pyglmnet/base.py 48.38% <0%> (+3.22%) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 1a4b9c7...44e0658. Read the comment docs.

jasmainak commented 4 years ago

sorry I am slow with responding. It has been crazy busy here.

This is impressive progress! Can we try to do away with the sklearn dependency now that all the tests pass? Basically copy the file that is needed into an externals folder and try to make it work? I think if we copy this file it might just work. And we already have BaseEstimator in base.py.

jasmainak commented 4 years ago

I'm fine with ditching the allow refit logic if it helps in making our software more sklearn compatible.

titipata commented 4 years ago

@jasmainak I'll follow up by tomorrow!

I think using BaseEstimator from scikit-learn allows you to use some of the functionalities e.g. get_params, set_params which can be used in Pipeline which current base.py cannot do.

I'm not quite sure about copying the file that you mentioned. Can you explain a bit more?

jasmainak commented 4 years ago

Basically instead of doing:

from sklearn.base import BaseEstimator
from sklearn.utils import check_random_state
from sklearn.utils.validation import check_X_y, check_array, check_is_fitted

you would do:

from .externals.sklearn.base import BaseEstimator
from .externals.sklearn.utils import check_random_state
from .externals.sklearn.utils.validation import check_X_y, check_array, check_is_fitted

hopefully it doesn't require you to copy too many files. As long as it's less than 5 files, I think we're fine.

jasmainak commented 4 years ago

Also you can add the get_params and set_params method from sklearn in our current base.py. I think I had copied what we have from MNE which had a whittled down version of what sklearn provides.

titipata commented 4 years ago

@jasmainak, I did add sklearn externals and also switch the base back to the one from base.py. Not sure if this is what you're expecting. Can you take over from here?

jasmainak commented 4 years ago

This is great!! awesome work @titipata :-) Happy to take over from here. Can you update whats_new.rst so we can acknowledge this in the next release?

jasmainak commented 4 years ago

@titipata you are on fire. I don't think I need to do a pass really. If you just address my last couple of comments (remove the commented test), move the line in whats_new.rst and update the title from WIP to MRG, I am happy to merge this PR once both CIs become green. Thank you so much for the efforts!

titipata commented 4 years ago

Awesome!! That sounds great to me! We might have to write some more test scripts later on :)

jasmainak commented 4 years ago

Yes I am looking forward to more contributions from you in the future!

titipata commented 4 years ago

@jasmainak, so I added the workaround for predict_proba error when using check_estimator. This is similar to the logic in sklearn's LogisticRegression.

jasmainak commented 4 years ago

Good to go from your end @titipata ? Everything resolved and works? Please set the PR title to MRG if so!

titipata commented 4 years ago

@jasmainak yes, I guess you can help polish a bit and that should be good to go!

jasmainak commented 4 years ago

I'll merge @titipata when the CIs are green. We're good to go!

titipata commented 4 years ago

Perfect 👍 @jasmainak. We can also poke JOSS editor right after. This PR should take care their concerns!

jasmainak commented 4 years ago

yep that's the plan! :)

jasmainak commented 4 years ago

All good, thanks a ton @titipata !