qiime2 / q2-feature-classifier

QIIME 2 plugin supporting taxonomic classification
BSD 3-Clause "New" or "Revised" License
18 stars 38 forks source link

possible to remove depth parameter? #2

Closed gregcaporaso closed 8 years ago

gregcaporaso commented 8 years ago

usually these just make as specific of assignments as they can.

BenKaehler commented 8 years ago

Check RDP and figure out what they do.

BenKaehler commented 8 years ago

@gregcaporaso @GavinHuttley, could you please look at this when you get a moment? I'm making some design decisions about the classifier.

My reading of Wang [1] is that they do everything on the genus level and use only classifications to genus level to train the classifier. Bootstrap confidence levels are calculated for higher orders by summing up the scores from lower orders.

So we have the problem that most of our classifications don't go to genus level. Most stop somewhere above there. People have thought about how to do this properly [2,3], but who has the time?

In scikit-learn documentation terminology, what we have is a multioutput-multiclass classification problem (http://scikit-learn.org/stable/modules/multiclass.html). Some learners (Decision Trees, Random Forests, Nearest Neighbors) are inherently able to handle this class of problem. We could perhaps employ an algorithm like the one in [3] with any of the multiclass classifiers (Naive Bayes, LDA and QDA, Decision Trees, Random Forests, Nearest Neighbors, Multinomial Logistic Regression, SVC), but without the fancy penalty function. So we would run the labels through the MultiLabelBinarizer documented here (http://scikit-learn.org/stable/modules/multiclass.html) before training the classifier.

So my current intention is to support the inherently multiouput-multiclass classifiers first (Decision Trees, Random Forests, Nearest Neighbours), then implement a hierarchical classifier that will work with any of the multiclass classifiers (Naive Bayes, LDA and QDA, Decision Trees, Random Forests, Nearest Neighbors, Multinomial Logistic Regression, SVC).

[1] Q. Wang, G. M. Garrity, J. M. Tiedje, and J. R. Cole. Naive bayesian classifier for rapid assignment of rrna sequences into the new bacterial taxonomy. Applied and environmental microbiology, 73(16):5261– 5267, 2007.

[2] N. Nguyen. Improving hierarchical classification with partial labels. In ECAI, pages 315–320, 2010.

[3] N. Cesa-Bianchi, C. Gentile, and L. Zaniboni. Incremental algorithms for hierarchical classification. Journal of Machine Learning Research, 7(Jan):31–54, 2006.

gregcaporaso commented 8 years ago

Hey Ben, I think it's really important to support Naive Bayes and SVM in the first release of this. Could you take the RDP approach, but classify to the species level (rather than genus level) and treat the species level taxonomy as the full taxonomy string (e.g., "kbacteria; pbacteroidetes; ... s__"). That's going to be really important for comparison to other methods, and you're basically there already. Once you have that, we should work with you to get a release up on conda. Then, I think adding some of the multioutput-multiclass classifiers and assessing their accuracy could be really good. No one is using those for taxonomic classification as far as I know, so that'd be novel for a publication.

On Sun, Aug 7, 2016 at 11:45 PM, BenKaehler notifications@github.com wrote:

@gregcaporaso https://github.com/gregcaporaso @GavinHuttley https://github.com/GavinHuttley, could you please look at this when you get a moment? I'm making some design decisions about the classifier.

My reading of Wang [1] is that they do everything on the genus level and use only classifications to genus level to train the classifier. Bootstrap confidence levels are calculated for higher orders by summing up the scores from lower orders.

So we have the problem that most of our classifications don't go to genus level. Most stop somewhere above there. People have thought about how to do this properly [2,3], but who has the time?

In scikit-learn documentation terminology, what we have is a multioutput-multiclass classification problem (http://scikit-learn.org/ stable/modules/multiclass.html). Some learners (Decision Trees, Random Forests, Nearest Neighbors) handle this class of problem inherently. However, It is not immediately obvious how the other classifiers (naive-bayes included) could be extended to this type of problem in the standard scikit-learn framework. We could perhaps employ an algorithm like the one in [3] with any of the multiclass classifiers (Naive Bayes, LDA and QDA, Decision Trees, Random Forests, Nearest Neighbors, Multinomial Logistic Regression, SVC). So that would mean we classify the kingdom, then each kingdom has a classifier for the phylum, and each phylum has a classifier for the class, etc. This approach would have some problems with phylogenetically misclassified reference taxa.

So my current intention is to support the inherently multiouput-multiclass classifiers first (Decision Trees, Random Forests, Nearest Neighbours), then implement a hierarchical classifier that will work with any of the multiclass classifiers (Naive Bayes, LDA and QDA, Decision Trees, Random Forests, Nearest Neighbors, Multinomial Logistic Regression, SVC).

[1] Q. Wang, G. M. Garrity, J. M. Tiedje, and J. R. Cole. Naive bayesian classifier for rapid assignment of rrna sequences into the new bacterial taxonomy. Applied and environmental microbiology, 73(16):5261– 5267, 2007.

[2] N. Nguyen. Improving hierarchical classification with partial labels. In ECAI, pages 315–320, 2010.

[3] N. Cesa-Bianchi, C. Gentile, and L. Zaniboni. Incremental algorithms for hierarchical classification. Journal of Machine Learning Research, 7(Jan):31–54, 2006.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/BenKaehler/q2-feature-classifier/issues/2#issuecomment-238154049, or mute the thread https://github.com/notifications/unsubscribe-auth/AALvdIU4UOBWARqVfjEIYxjwVCv9fwLiks5qdtBygaJpZM4Jc39O .

BenKaehler commented 8 years ago

Thanks Greg, ok, I'll concentrate on getting the conda release done then revisit this issue. If I can get default parameters working easily I'll make it a default parameter (set to seven). Otherwise I'll just remove the code.

gregcaporaso commented 8 years ago

@BenKaehler, one other note: you're working with the Greengenes reference right now, which has seven levels, but that's not always the case (for example, so taxonomies will include sub-levels, like sub-class, sub-phylum, etc). What if you just defaulted to always doing the deepest assignment possible with a given reference taxonomy? Then we could have a pipeline that filters the reference taxonomy if you want less specific assignments.

BenKaehler commented 8 years ago

Ok, I'll just take it out.

BenKaehler commented 8 years ago

Done in commit e15278d5be25dfd9d2c641334de4a5d4320c3143