Open peterjc opened 5 years ago
This might be needed for the existing classifiers (except identity
) if the DB contains Phytophthora species plus almost identical entries labelled as Phytophthora (genus only).
In this case, where the fuzzy matching might have assigned a species (e.g. matches Phytophthora alni with one base pair change), this could be demoted to a genus level match (e.g. matches a Phytophthora genus only entry perfectly).
Should be able to use our single isolate control plate to gauge how often this happens.
Depends on various settings, but yes, one case on the control single isolates with the onebp
classifier, a TP (Phytophthora fallax) becoming a FN (just Phytophthora):
$ for f in assess_sample_L5-*_onebp_v0.0.15.tsv; do echo $f; xsv table $f | head -n 2; done
assess_sample_L5-2019-01-01-v0.0.15_onebp_v0.0.15.tsv
#Species TP FP FN TN sensitivity specificity precision F1 Hamming-loss Ad-hoc-loss
OVERALL 64 31 5 9767 0.93 1.00 0.67 0.78 0.0036 0.36
assess_sample_L5-and-1358-PnotP-2019-01-01-v0.0.15_onebp_v0.0.15.tsv
#Species TP FP FN TN sensitivity specificity precision F1 Hamming-loss Ad-hoc-loss
OVERALL 64 31 5 9767 0.93 1.00 0.67 0.78 0.0036 0.36
assess_sample_L5-and-8336-Peronosporaceae-2019-01-01-v0.0.15_onebp_v0.0.15.tsv
#Species TP FP FN TN sensitivity specificity precision F1 Hamming-loss Ad-hoc-loss
OVERALL 63 31 6 9767 0.91 1.00 0.67 0.77 0.0037 0.37
...
Likewise for blast
, losing a sample TP (Phytophthora fallax) becoming a FN (just Phytophthora). However, here the Phytophthora genus only entries are also having a positive effect by reducing the FP count - but that's due to #106 (we were accepting weak matches):
$ for f in assess_sample_L5-*_blast_v0.0.15.tsv; do echo $f; xsv table $f | head -n 2; done
assess_sample_L5-2019-01-01-v0.0.15_blast_v0.0.15.tsv
#Species TP FP FN TN sensitivity specificity precision F1 Hamming-loss Ad-hoc-loss
OVERALL 64 62 5 9736 0.93 0.99 0.51 0.66 0.0068 0.51
assess_sample_L5-and-1358-PnotP-2019-01-01-v0.0.15_blast_v0.0.15.tsv
#Species TP FP FN TN sensitivity specificity precision F1 Hamming-loss Ad-hoc-loss
OVERALL 64 62 5 9736 0.93 0.99 0.51 0.66 0.0068 0.51
assess_sample_L5-and-8336-Peronosporaceae-2019-01-01-v0.0.15_blast_v0.0.15.tsv
#Species TP FP FN TN sensitivity specificity precision F1 Hamming-loss Ad-hoc-loss
OVERALL 63 31 6 9767 0.91 1.00 0.67 0.77 0.0037 0.37
...
Some of the output from the new edit-graph
command added in #144 makes me think we can try a more fuzzy genus level classifier (e.g. up to 2bp edit distance).
This was done with 1s3g
in v0.7.3, and 1s2g
, 1s4g
and 1s5g
in v0.7.4 - but this is all effectively a special case, not the generalisation I was pondering with this issue.
Cross reference #597, might want a rank-specific classifier setting?
The optimal methods for classification at species or genus level are likely different. Intuitively we might want to allow one or two edits between a DB entry and a sample when calling species, but a far larger edit distance when calling at genus level.
Issue #101 / #102 looks as adding sequences to the database where the exact species is untrusted, but the genus is fine. In particular, the motivation here is to include some sister genus level entries in the DB. Here we expect the existing classifier methods to label more of the previously unknown sequences as genus level (and occasionally downgrade a species specific prediction to genus level only).
It is my expectation that the strict classifiers (
identity
andonebp
) will only classify a minority of the previously unknowns (depends heavily on the DB coverage vs our environment sample diversity) while the fuzzy classifiers (blast
,swarmid
,swarm
) will make many more genus level calls.It would therefore seem practical to enhance the tool to support a two level classification, one set of methods aimed at species level, and another set aimed at genus level. These could be run in series (e.g. if no species is called, try for a genus), or perhaps in parallel (effectively as implemented now but with a results synthesis step).