scikit-learn / scikit-learn

scikit-learn: machine learning in Python
https://scikit-learn.org
BSD 3-Clause "New" or "Revised" License
60.04k stars 25.39k forks source link

Add predictive_proba/sparse support to documentation #1107

Closed buma closed 12 years ago

buma commented 12 years ago

I used scikit and It bothered me that in documentation wasn't written that predictive_probabilites are created only in boolean classification problems on some classifiers.

It also bothered me that it is not specified everywhere if classifier supports sparse input?

Can I add this or is someone already working on it?

ogrisel commented 12 years ago

Please feel free to send pull requests on the specific parts of the documentation that are incomplete.

amueller commented 12 years ago

Usually the doc string specify the input either as sparse matrix, ndarray or just ndarray.

buma commented 12 years ago

Thanks I see now it seems I didn't noticed before that sparse support is already specified.

amueller commented 12 years ago

Well, it is a bit hidden. If you have an idea where to document it so that it is more obvious to new users, any suggestions are welcome.

Which was the estimator that only supported predict_proba in the binary case?

amueller commented 12 years ago

Maybe we can add somewhere how sparse support is documented? Like for each estimator, the docstring says ... as I said above. But then this has to be a place that new users definitely read. In the introduction maybe? In the API section? (pretty sure they don't read that).

buma commented 12 years ago

An idea would be a page all classifiers that support sparse.

It was Perceptron and SGDClassifier.

And I just found something weird aboud SGDClassifier http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html :

This implementation works with data represented as dense numpy arrays of floating point values for the features.

And in fit method it is written that it supports sparse arrays. I used it and arrays weren't dense, because dense arrays were to big for my memory.

2012/9/3 Andreas Mueller notifications@github.com

Well, it is a bit hidden. If you have an idea where to document it so that it is more obvious to new users, any suggestions are welcome.

Which was the estimator that only supported predict_proba in the binary case?

— Reply to this email directly or view it on GitHubhttps://github.com/scikit-learn/scikit-learn/issues/1107#issuecomment-8242515.

GaelVaroquaux commented 12 years ago

An idea would be a page all classifiers that support sparse.

The danger is that it would quickly fall out of sync.

It was Perceptron and SGDClassifier.

And I just found something weird aboud SGDClassifier http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html :

This implementation works with data represented as dense numpy arrays of floating point values for the features.

And in fit method it is written that it supports sparse arrays. I used it and arrays weren't dense, because dense arrays were to big for my memory.

Well, maybe you found a place where it fell out of sync :(.

I would suggest generating that part of the documentation, using a sphinx extension similar to the automatic tests set up by Andreas.

amueller commented 12 years ago

Hm ok, I remember the thing about Perceptron and SGDClassifier. If you could provide a pull request that adds this to the docstring of predict_proba, that would be very helpful. I think that would be the right place to have this comment.

About the page for sparse support: There is this PR here for an estimator overview, but I think it is still a bit controversial.

amueller commented 12 years ago

@GaelVaroquaux haha I think auto generating might actually be feasible. It will not be possible to have something like "does only support predict_proba in binary case" but it could do a good overview.

ogrisel commented 12 years ago

The SGDClassifier docstring discrepancies probably stem from unupdated docstring when we did the merge from the dense and sparse codebases into a single code base.

fmailhot commented 12 years ago

Sorry about the blank message previously, too quick on the "Send".

On 3 September 2012 09:40, Andreas Mueller notifications@github.com wrote:

Hm ok, I remember the thing about Perceptron and SGDClassifier.

Just wanted to point out that there is an existing discussion about predict_proba in the multiclass case:

http://comments.gmane.org/gmane.comp.python.scikit-learn/3562 http://comments.gmane.org/gmane.comp.python.scikit-learn/3381

and a WIP (that seems to have stalled a couple of months ago):

https://github.com/scikit-learn/scikit-learn/pull/849

Cheers, Fred.

amueller commented 12 years ago

Yeah, the WIP by Peter is actually quite important to me... but this is another issue ;)

buma commented 12 years ago

Interesting idea about classifier page. It should probably be generated.

buma commented 12 years ago

First try Pull request. I didn't find the code for the Perceptron, Will the documentation be updated for Perceptron also, becuse class is derived from SGDClassifier?

Sphinx version: 1.1.3 Python version: 2.7.3 Docutils version: 0.9.1 release Jinja2 version: 2.6

I tried to see documentation with make html, but I get an error from Sphinx. writing output... [ 31%] datasets/index
Exception occurred: File "/usr/lib/python2.7/site-packages/docutils/writers/html4css1/init.py", line 1026, in visit_image and self.settings.file_insertion_enabled): AttributeError: Values instance has no attribute 'file_insertion_enabled'

amueller commented 12 years ago

Yes, the predict_proba function is the same. The code for the perceptron is in linear_model/perceptron.py. I don't know about the sphinx error. Btw, you can usually just make. make html builds the pictures and is pretty slow.

amueller commented 12 years ago

@buma Thanks for the correction in the docs. The classifier summary is already an open issue, so I guess we can close this one, ok?

buma commented 12 years ago

Thanks for merging. Yes I thing it should be closed.