scikit-learn / scikit-learn

scikit-learn: machine learning in Python
https://scikit-learn.org
BSD 3-Clause "New" or "Revised" License
59.44k stars 25.26k forks source link

Bayesian Logistic Regression / Logistic Regression with ARD #6259

Closed AlexYurasov closed 5 months ago

AlexYurasov commented 8 years ago

There are Bayesian Linear Regression and ARD regression in scikit, are there any plans to include Bayesian / ARD Logistic Regression? ARD version will be really helpful for identifying relevant features.

agramfort commented 8 years ago

there is no plan AFAIK. But an implementation with a sklearn API that passes the check_estimator

http://scikit-learn.org/stable/modules/generated/sklearn.utils.estimator_checks.check_estimator.html

that one can find on github would be nice to have.

AmazaspShumik commented 8 years ago

@agramfort

Hi, I have implemented ARD Logistic Regression with sklearn API. I also made small ipython notebook comparing perfomance of ARD, L1, L2 logistic regressions.

code: https://github.com/AmazaspShumik/Bayesian-Regression-Methods/blob/master/Fast%20Relevance%20Vector%20Machine/fast_scikit_rvm.py

ipython notebook: https://github.com/AmazaspShumik/Bayesian-Regression-Methods/blob/master/Fast%20Relevance%20Vector%20Machine/ard_classification_demo.ipynb

agramfort commented 8 years ago

you should make a proper package with tests and example checking sklearn API.

then we could add it to the list of related projects

AmazaspShumik commented 8 years ago

@agramfort

Hi , sorry for delay I was working on tests & ipython examples. Here is link to package: https://github.com/AmazaspShumik/sklearn_bayes .

amueller commented 7 years ago

@agramfort why not get this into scikit-learn?

amueller commented 7 years ago

@AmazaspShumik do you want to get this into scikit-learn-contrib? https://github.com/scikit-learn-contrib/scikit-learn-contrib/blob/master/README.md

amueller commented 7 years ago

Also see #1513

AmazaspShumik commented 7 years ago

@amueller @agramfort @AlexYurasov

Hi,

I will be more than happy to contribute. I am on vacation now so if it is ok I can start working on it on Monday.

agramfort commented 7 years ago

no objection to move this into sklearn but it's easier to start with an outside package.

amueller commented 7 years ago

There is a package, though it doesn't fulfil our testing and doc standard. I haven't check api compatibility. @AmazaspShumik can you include check_estimator in your test-suite? You don't have continuous integration set up, do you?

I kind of want all these models but maybe @agramfort is right about going to scikit-learn-contrib first. Maybe @GaelVaroquaux and @jnothman have opinions?

AmazaspShumik commented 7 years ago

Hi @amueller @agramfort @GaelVaroquaux

1) I do not have continuous integration (but I plan to include it, I actually tried to do it before with Travis CI )

2) I will update test suite. However, I used check_estimator and all regression and classification
algorithms (with the exception of RVR ) that are currently included in sklearn-bayes package pass check_estimator.

In case you are interested here is my plan for package updates in next 2-3 weeks:

New Algorithms I plan to include:

GaelVaroquaux commented 7 years ago

Maybe @GaelVaroquaux and @jnothman have opinions?

It seemed to me that these types of models has mostly lost steam lately.

amueller commented 7 years ago

@GaelVaroquaux in academia but not in industry, I think. I haven't seen an extensive comparison but people really like linear models.

GaelVaroquaux commented 7 years ago

I really like linear models :). I was more thinking of the Bayesian thing. Is it still used in the industry? I thought that it never had been used.

I do know some cases where it has an edge (sparse recovery with a noise with known covariance, and the covariance is injected in the model), but AFAIK they are rare.

amueller commented 7 years ago

I think ARD is helpful in particular with unknown covariance. I talk to people and they tell me they use a thing because... you know it works? I think people rarely take the time to do full comparisons. My boss at Amazon was really into EP ;) And I think it was a hedge fund that recently asked me if we want their implementation for EP.

AmazaspShumik commented 7 years ago

@amueller

Hi, really sorry it took me so long! It is really difficult to make some parts of my package (skbayes) to pass check_estimator(), for instance, Bernoulli and Poisson Mixture models accept only specific type of inputs and fail to pass tests. Other types of models, like Restricted Boltzmann Machines and Latent Dirichlet Allocation, are already included in sklearn .

So I just wanted to ask will it be ok if I will create a separate package that includes only ARD / Variational ARD / Linear models? I significantly improved all of these models over last month, all of them pass check_estimator and are python 3 compatible, so it should be relatively easy to create such package to satisfy criteria for scikit-learn-contrib.

Thanks!

jnothman commented 7 years ago

I don't think we can make check_estimator a hard criterion, for the reasons you state: it's not flexible enough (yet). So don't worry too much if you can't make it entirely pass, but please do comment, perhaps at #6715 on the ways in which check_estimator needs to be more flexible to admit your estimators...

On 16 November 2016 at 04:37, Amazasp notifications@github.com wrote:

@amueller https://github.com/amueller

Hi, really sorry it took me so long! It is really difficult to make some parts of my package (skbayes) to pass check_estimator(), for instance, Bernoulli and Poisson Mixture models accept only specific type of inputs and fail to pass tests. Other types of models, like Restricted Boltzmann Machines and Latent Dirichlet Allocation, are already included in sklearn .

So I just wanted to ask will it be ok if I will create a separate package that includes only ARD / Variational ARD / Linear models? I significantly improved all of these models over last month, all of them pass check_estimator and are python 3 compatible, so it should be relatively easy to create such package to satisfy criteria for scikit-learn-contrib.

Thanks!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/scikit-learn/scikit-learn/issues/6259#issuecomment-260711056, or mute the thread https://github.com/notifications/unsubscribe-auth/AAEz62zE8wLyEORgczKICpotyzG1jWFmks5q-e33gaJpZM4HQPyX .

amueller commented 7 years ago

I totally agree with @jnothman. If you'd like to move your whole project, we can do that for now even if it doesn't entirely pass check_estimator. If you'd rather move out parts as a separate project, that's fine, too.

I'm also a bit curious about how your ARD compares to our ARD, and similar for your BayesianRidge. Our "legacy" implementations are probably not that great. @agramfort suggested taking the one from MNEPython in #2542 though I'm not sure that's still relevant? If it is, we should compare all three and make sure that the "best" is in sklearn.

AmazaspShumik commented 7 years ago

@amueller Sorry for my late reply.

1) Moving Project

2) Comparison with "legacy" implementation.

3) Comparison with MNEPython implementation

jnothman commented 7 years ago

Apart from check_estimator, have you taken a look at the workflow at https://github.com/scikit-learn-contrib/scikit-learn-contrib/blob/master/workflow.md? A request for inclusion there would provide a dedicated space for discussion of any issues, but I expect there'll be few.

On 15 January 2017 at 16:00, Amazasp notifications@github.com wrote:

@amueller https://github.com/amueller Sorry for my late reply.

1.

Moving Project @amueller https://github.com/amueller @jnothman https://github.com/jnothman I would like to move the whole project now if it is an option. Currently, I have little time to work on it, so maybe other people will contribute! I will have more time in February to improve package and work on appropriate tests. 2.

Comparison with "legacy" implementation.

  • ARD: In cases, when n_samples <= n_features, both algorithms have similar MSE, but "legacy" implementation is significantly slower. In cases, when n_samples >> n_features both algorithms obtain bad results. (Note that my implementation of ARD is based on RVM article by Tipping, where input is kernel matrix i.e. it has size of (n_samples,n_samples ) )
  • Also while "legacy" algorithms do not provide an opportunity to define prior, it can be done in some algorithms included in skbayes (for instance VBRegressionARD and VBLinearRegression) .

    1. Comparison with MNEPython implementation

As far as I understand MNEPython implementation is based on the paper by Wipf, if that is the case, then it should have much better accuracy for cases when n_features >> n_samples since it is less likely to be trapped in local minimum for high dimensional data. But I am not sure how fast it is.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/scikit-learn/scikit-learn/issues/6259#issuecomment-272674007, or mute the thread https://github.com/notifications/unsubscribe-auth/AAEz6yjnC5nOaJD29X-fh5d35IGjjAEoks5rSafvgaJpZM4HQPyX .

AmazaspShumik commented 7 years ago

@jnothman Thanks for quick answer!

I just had a look, is it ok to make a request for inclusion if there are still some problems, and then work & discuss them?

I still have some issues with Travis CI (it fails every time I include c extensions for GibbsLDA and Hidden Markov Models, I am still trying to figure out why) Three algorithms in the package are not Python 3 compatible (this should be easy to resolve, it is mainly because of "print"). I would probably need some more tests for unsupervised learning algorithms in package, which do not pass check_estimator (all supervised learning algorithms pass check_estimator)

jnothman commented 7 years ago

Best to sort out basic things like Py2/3 in before submitting, but otherwise, issues can be discussed.

On 16 January 2017 at 07:40, Amazasp notifications@github.com wrote:

@jnothman https://github.com/jnothman Thanks for quick answer!

I just had a look, is it ok to make a request for inclusion if there are still some problems, and then work & discuss them?

I still have some issues with Travis CI (it fails every time I include c extensions for GibbsLDA and Hidden Markov Models, I am still trying to figure out why) Three algorithms in the package are not Python 3 compatible (this should be easy to resolve, it is mainly because of "print"). I would probably need some more tests for unsupervised learning algorithms in package, which do not pass check_estimator (all supervised learning algorithms pass check_estimator)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/scikit-learn/scikit-learn/issues/6259#issuecomment-272722073, or mute the thread https://github.com/notifications/unsubscribe-auth/AAEz645MLOtTtOHrjQVe8mVDoMY0Ttdfks5rSoQhgaJpZM4HQPyX .

AmazaspShumik commented 7 years ago

@jnothman

I fixed issues with Python 2/3 and extension modules, build passes in Python 2.7 and Python 3.4 / 3.5 (using Travis CI).

However, as I mentioned before I have little tests for unsupervised learning algorithms. I am busy currently, so I will need a little time writing them. Is it ok if I will come back to you at the end of this week?

Thanks)

jnothman commented 7 years ago

There's no rush!

I should note that one of my main concerns about sklearn-bayes is the lack of unit tests. (This is my concern more as someone who would like to be able to promote sklearn-bayes to my colleagues doing applied bayesian modelling.) I appreciate that much of the functionality is tested in ipynb, and that perhaps basic reasonable behaviour is tested by scikit-learn's check_estimator. (Perhaps the IPython notebooks could be included in a continuous integration test harness, but there are challenges in that.) It would be great if there were tests that the models were reasonable (at PPD expectation and in distribution) on toy / known datasets (with one iteration, if possible; and at convergence), that methods predict and predict_dist agree with each other, that model attributes are present and have the expected shape, that they emit appropriate warnings if convergence demonstrably fails, etc.

If you don't have the time to do this, but might have time to review it, I wouldn't mind if you put out a call for contributors to the scikit-learn mailing list.

On 17 January 2017 at 08:22, Amazasp notifications@github.com wrote:

@jnothman https://github.com/jnothman

I fixed issues with Python 2/3 and extension modules, build passes in Python 2.7 and Python 3.4 / 3.5 (using Travis CI).

However, as I mentioned before I have little tests for unsupervised learning algorithms. I am busy currently, so I will need a little time writing them. Is it ok if I will come back to you at the end of this week?

Thanks)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/scikit-learn/scikit-learn/issues/6259#issuecomment-272964929, or mute the thread https://github.com/notifications/unsubscribe-auth/AAEz673eNDjIPtEeaFTYfTdRQxwRVZwsks5rS9-cgaJpZM4HQPyX .

AmazaspShumik commented 7 years ago

@jnothman Thank you very much for the advice I appreciate it.

I think I can add more robust testing for all supervised learning algorithms (they constitute about half of the package) in the span of two weeks. As for the rest of the package, it may be more difficult, because there is a substantial amount of code refactoring to do so that I might need help with that.

The problem is that some algorithms in my package already exist in scikit-learn (like RBM and LDA) or other great packages (like Bayesian HMM) and there is no point to spend time on them. So I am not sure it will make any sense to work on the whole package, maybe it will be reasonable to separate supervised learning and mixture models into a separate package which fully satisfies conditions for inclusion into scikit-learn contrib? ( I can do that in one month probably). Then new package can gradually grow and include more and more Bayesian algorithms.

What do you think @jnothman?

jnothman commented 7 years ago

Yes, I think that's a reasonable conclusion.

adrinjalali commented 5 months ago

Seems like this is not something we'd be easily adding to scikit-learn since I don't see much recurring request for it, and having it somewhere else might be ideal. So closing this one.