Two level classification: Species & Genus

peterjc commented 5 years ago

The optimal methods for classification at species or genus level are likely different. Intuitively we might want to allow one or two edits between a DB entry and a sample when calling species, but a far larger edit distance when calling at genus level.

Issue #101 / #102 looks as adding sequences to the database where the exact species is untrusted, but the genus is fine. In particular, the motivation here is to include some sister genus level entries in the DB. Here we expect the existing classifier methods to label more of the previously unknown sequences as genus level (and occasionally downgrade a species specific prediction to genus level only).

It is my expectation that the strict classifiers (identity and onebp) will only classify a minority of the previously unknowns (depends heavily on the DB coverage vs our environment sample diversity) while the fuzzy classifiers (blast, swarmid, swarm) will make many more genus level calls.

It would therefore seem practical to enhance the tool to support a two level classification, one set of methods aimed at species level, and another set aimed at genus level. These could be run in series (e.g. if no species is called, try for a genus), or perhaps in parallel (effectively as implemented now but with a results synthesis step).

peterjc commented 5 years ago

This might be needed for the existing classifiers (except identity) if the DB contains Phytophthora species plus almost identical entries labelled as Phytophthora (genus only).

In this case, where the fuzzy matching might have assigned a species (e.g. matches Phytophthora alni with one base pair change), this could be demoted to a genus level match (e.g. matches a Phytophthora genus only entry perfectly).

Should be able to use our single isolate control plate to gauge how often this happens.

peterjc commented 5 years ago

Depends on various settings, but yes, one case on the control single isolates with the onebp classifier, a TP (Phytophthora fallax) becoming a FN (just Phytophthora):

$ for f in assess_sample_L5-*_onebp_v0.0.15.tsv; do echo $f; xsv table $f | head -n 2; done
assess_sample_L5-2019-01-01-v0.0.15_onebp_v0.0.15.tsv
#Species                         TP  FP  FN  TN    sensitivity  specificity  precision  F1    Hamming-loss  Ad-hoc-loss
OVERALL                          64  31  5   9767  0.93         1.00         0.67       0.78  0.0036        0.36
assess_sample_L5-and-1358-PnotP-2019-01-01-v0.0.15_onebp_v0.0.15.tsv
#Species                         TP  FP  FN  TN    sensitivity  specificity  precision  F1    Hamming-loss  Ad-hoc-loss
OVERALL                          64  31  5   9767  0.93         1.00         0.67       0.78  0.0036        0.36
assess_sample_L5-and-8336-Peronosporaceae-2019-01-01-v0.0.15_onebp_v0.0.15.tsv
#Species                         TP  FP  FN  TN    sensitivity  specificity  precision  F1    Hamming-loss  Ad-hoc-loss
OVERALL                          63  31  6   9767  0.91         1.00         0.67       0.77  0.0037        0.37
...

Likewise for blast, losing a sample TP (Phytophthora fallax) becoming a FN (just Phytophthora). However, here the Phytophthora genus only entries are also having a positive effect by reducing the FP count - but that's due to #106 (we were accepting weak matches):

$ for f in assess_sample_L5-*_blast_v0.0.15.tsv; do echo $f; xsv table $f | head -n 2; done
assess_sample_L5-2019-01-01-v0.0.15_blast_v0.0.15.tsv
#Species                         TP  FP  FN  TN    sensitivity  specificity  precision  F1    Hamming-loss  Ad-hoc-loss
OVERALL                          64  62  5   9736  0.93         0.99         0.51       0.66  0.0068        0.51
assess_sample_L5-and-1358-PnotP-2019-01-01-v0.0.15_blast_v0.0.15.tsv
#Species                         TP  FP  FN  TN    sensitivity  specificity  precision  F1    Hamming-loss  Ad-hoc-loss
OVERALL                          64  62  5   9736  0.93         0.99         0.51       0.66  0.0068        0.51
assess_sample_L5-and-8336-Peronosporaceae-2019-01-01-v0.0.15_blast_v0.0.15.tsv
#Species                         TP  FP  FN  TN    sensitivity  specificity  precision  F1    Hamming-loss  Ad-hoc-loss
OVERALL                          63  31  6   9767  0.91         1.00         0.67       0.77  0.0037        0.37
...

peterjc commented 5 years ago

Some of the output from the new edit-graph command added in #144 makes me think we can try a more fuzzy genus level classifier (e.g. up to 2bp edit distance).

peterjc commented 3 years ago

This was done with 1s3g in v0.7.3, and 1s2g, 1s4g and 1s5g in v0.7.4 - but this is all effectively a special case, not the generalisation I was pondering with this issue.

peterjc commented 8 months ago

Cross reference #597, might want a rank-specific classifier setting?

peterjc / thapbi-pict

Two level classification: Species & Genus #105