Closed wbazant closed 2 years ago
There needs to be a rule about match length.
Have a look at a quasi-randomly selected SRR5405746 from NICU NEC, compared to SRR5405716 which I selected because it has see lots of independently reported Purpureocillium lilacinum:
# random file from NICUNEC
grep -v @ /project/eupathdblab/workflows/MicrobiomeDB/5/data/otuDADA2_NICUNEC_RSRC/Eukdetect_NICUNEC/work/9c/af3f78d2c115c7113190db74b222bd/alignmentsSingle.sam | cut -f 6 | sort | uniq -c | sort -nr | head
882 20M
327 21M
98 22M
36 23M
7 24M
3 29M
3 25M
1 5M1I25M
1 4M3I23M
1 4M1I36M
# file from NICUNEC where lots of P. lilacinum was reported
[wbazant@node155 NICUNEC]$ grep -v @ /project/eupathdblab/workflows/MicrobiomeDB/5/data/otuDADA2_NICUNEC_RSRC/Eukdetect_NICUNEC/work/cf/d6105c5a2cb3924b39fcafc00e4837/alignmentsSingle.sam | cut -f 6 | sort | uniq -c | sort -nr | head
1930 150M
137 20M
64 21M
24 22M
22 125M21I4M
12 44M6D106M
11 23M
10 4M12I134M
9 84M6I60M
9 7M1D143M
Just to demonstrate, add "grep lilacinum" and look at that part of the signal in the alignment:
[wbazant@node155 NICUNEC]$ grep -v @ /project/eupathdblab/workflows/MicrobiomeDB/5/data/otuDADA2_NICUNEC_RSRC/Eukdetect_NICUNEC/work/9c/af3f78d2c115c7113190db74b222bd/alignmentsSingle.sam | grep lilacinum | cut -f 6 | sort | uniq -c | sort -nr | head
2 20M
1 23M
[wbazant@node155 NICUNEC]$ grep -v @ /project/eupathdblab/workflows/MicrobiomeDB/5/data/otuDADA2_NICUNEC_RSRC/Eukdetect_NICUNEC/work/9c/af3f78d2c115c7113190db74b222bd/alignmentsSingle.sam | grep lilacinum | cut -f 6 | sort | uniq -c | sort -nr | head
[wbazant@node155 NICUNEC]$ grep -v @ /project/eupathdblab/workflows/MicrobiomeDB/5/data/otuDADA2_NICUNEC_RSRC/Eukdetect_NICUNEC/work/cf/d6105c5a2cb3924b39fcafc00e4837/alignmentsSingle.sam | grep lilacinum | cut -f 6 | sort | uniq -c | sort -nr | head
1805 150M
22 125M21I4M
10 4M12I134M
9 84M6I60M
9 7M1D143M
8 137M9I4M
7 133M13I4M
6 139M7I4M
5 5M5I140M
5 4M4I142M
"20M" is twenty matched bases, and I guess it's not long enough.
I don't think that filtering on quality scores helps - the "20M" matches score all over:
grep -v @ /project/eupathdblab/workflows/MicrobiomeDB/5/data/otuDADA2_NICUNEC_RSRC/Eukdetect_NICUNEC/work/9c/af3f78d2c115c7113190db74b222bd/alignmentsSingle.sam | cut -f 5,6 | grep 20M | sort | uniq -c
80 0 20M
65 1 20M
14 2 20M
220 23 20M
191 24 20M
6 31 20M
3 32 20M
5 34 20M
248 40 20M
10 42 20M
24 6 20M
16 7 20M
"four reads and two markers" is the EukDetect rule for including a taxon.
I've found that the "four reads" part drops signal and doesn't help with noise, the rule effectively becomes "four markers or more" because usually a marker only gets one read.
I think two markers is a good cut-off, although sometimes it passes noise through.
In cases where there are bad taxa with matches on two markers or more, increasing the cut-off doesn't help. I had a go at modelling the matches to markers under the assumption that they happen randomly and picking a cut-off at a level where we no longer expect matches, mostly to discover that the noise pattern is more complicated than that. In datasets with a lot of noise like NICU NEC, even requiring twenty markers per taxon is not giving good enough results.
I think bad matches to markers are caused by bacteria in the sample aligning to bad marker sequences. A bit of a pattern emerges at a dataset level when the matches are not spread evenly over markers, for example, Stygamoeba regulata in DIABIMMUNE_WGS has matches over four markers, with multiplicity of 1, 1, 16, 47 - and the other 112 markers in the reference are not matched at all.
I've used a dataset level cutoff - keep a taxon if it's present in three or more samples - possibly quite successful!