Open mblondel opened 11 years ago
Are you familiar enough with Weka (or anything else) to write idiomatic, equivalent code? We have to make sure the comparison is fair.
I like the idea, however I am not sure Weka would lead to a fair comparison. Most users indeed use Weka through the GUI: there is no code to write! You can of course import Weka packages and write yourself Java programs, but I don't think most users do that.
I think R would make a better comparison. It is quite popular, has tons of machine learning packages and compares with Python as a scientific ecosystem. The user interface is also the same as Python (either with scripts or with a shell). An issue though (that we may highlight though), is that R algorithms do not share any common interface: it might be quite easy to come up with an unfair example because one of those algorithms has an odd interface. The same goes for Matlab...
We're almost at the 15 pages limit. Is this something we might want to drop?
I like the idea, however I am not sure Weka would lead to a fair comparison. Most users indeed use Weka through the GUI: there is no code to write! You can of course import Weka packages and write yourself Java programs, but I don't think most users do that.
I know a few people who do their research in Java and use the Weka API...
I think R would make a better comparison.
Comparing scikit-learn to R doesn't make sense to me. R is alanguage and statistical environment...
We're almost at the 15 pages limit. Is this something we might want to drop?
If we have enough space, I think it's important to compare with what has been done before (just like for any paper)...
We could compare to (or at least mention) Torch7
They have a NIPS workshop paper to cite.
Another candidate for comparison is milk. This would illustrate the model / estimator separation. The downside is that they don't appear to have a paper to cite.
After some thoughts, maybe comparing with Weka is the good thing to do after all. It is one of the most popular packages for machine learning, while Torch7, milk and other are clearly less used.
I am also wondering if the comparison with Gensim shouldn't shortened? This is relevant, but I admit that I had never heard of it before.
Emphasize on scikit-learn integration in the scientific Python ecosystem (e.g., with NLTK) is also important in my opinion. (As such, we may need to change the section title depending on what we decide to include.)
If we don't want to make a code comparison, we can also rename the section to "Related software".
Feel free to trim the Gensim comparison as needed, or remove it if it doesn't fit in anymore.
i find the gensim part relevant although it could be trimmed to make some room for other projects like R.
For R the main competing packages are according to my subjective reading of the kaggle forums:
Probably others but I am no R user myself.
Comparing scikit-learn to R doesn't make sense to me. R is alanguage and statistical environment...
R is a language but also a very cohesive developer community. The CRAN package system makes it very easy to install and combine several machine learning projects.
To me R + most commonly used CRAN packages is the main competitor for scikit-learn.
I'm afraid we need an R (ex-)user...
(I never got used to R myself. I did a t-test in it once, made a plot, and decided I hate the language.)
@pprett might be able to help here.
Do we get away with renaming the section "Related software" ? (i.e. without code comparison) The deadline is on Friday...
I guess so. We can move the note on Weka that is in section 2.1 to this section.
I have time to help in the coming days. However, I have no knowledge on R and too few on weka.
I improved the related software section quite a bit.
Regarding the contributions in scikit-learn that made it to SciPy, I don't think the related software section is the right place for it. I guess we could add a paragraph dedicated to that to the conclusion. @jakevdp do you want to take a stab at it?
without reading all of the above, I think a comparison with caret would make sense, though weka is probably easier and also sensible ;) oh and don't forget shogun!
oh and don't forget shogun!
We could state that scikit-learn have competitive advantage over machine learning library written in statistically type language, such as shogun and torch, because of the possibility of interactive developments using the console, of type inference and of overall reduced development time (less line of codes, no or less segmentation fault).
+1. I think a more detailed comment would be good but I guess there is too little time
If you don't mind, I'll just copyedit the comparison section tomorrow and let the reviewers decide. I hardly know enough about other packages; I only used LBJ, looked at the Weka APIs and toyed with VW and some CRF command line tools.
Sorry - I've been busy with the Scipy conference and just saw this thread. It sounds like we're going to keep things as-is right now?
If you want, I think you can still make modifications during the review period.
No more major changes, please.
I was thinking we should take one of the code examples of the paper (say, the logistic regression one) and reimplement it using another toolkit (say, Weka, since it is one of the most famous).
WDYT?