Update filtering rules - Githubissues

"four reads and two markers" is the EukDetect rule for including a taxon.

I've found that the "four reads" part drops signal and doesn't help with noise, the rule effectively becomes "four markers or more" because usually a marker only gets one read.

I think two markers is a good cut-off, although sometimes it passes noise through.

In cases where there are bad taxa with matches on two markers or more, increasing the cut-off doesn't help. I had a go at modelling the matches to markers under the assumption that they happen randomly and picking a cut-off at a level where we no longer expect matches, mostly to discover that the noise pattern is more complicated than that. In datasets with a lot of noise like NICU NEC, even requiring twenty markers per taxon is not giving good enough results.

I think bad matches to markers are caused by bacteria in the sample aligning to bad marker sequences. A bit of a pattern emerges at a dataset level when the matches are not spread evenly over markers, for example, Stygamoeba regulata in DIABIMMUNE_WGS has matches over four markers, with multiplicity of 1, 1, 16, 47 - and the other 112 markers in the reference are not matched at all.

I've used a dataset level cutoff - keep a taxon if it's present in three or more samples - possibly quite successful!

There needs to be a rule about match length.

Have a look at a quasi-randomly selected SRR5405746 from NICU NEC, compared to SRR5405716 which I selected because it has see lots of independently reported Purpureocillium lilacinum:

# random file from NICUNEC
grep -v @ /project/eupathdblab/workflows/MicrobiomeDB/5/data/otuDADA2_NICUNEC_RSRC/Eukdetect_NICUNEC/work/9c/af3f78d2c115c7113190db74b222bd/alignmentsSingle.sam | cut -f 6 | sort | uniq -c  | sort -nr | head 
    882 20M
    327 21M
     98 22M
     36 23M
      7 24M
      3 29M
      3 25M
      1 5M1I25M
      1 4M3I23M
      1 4M1I36M

# file from NICUNEC where lots of P. lilacinum was reported
[wbazant@node155 NICUNEC]$ grep -v @ /project/eupathdblab/workflows/MicrobiomeDB/5/data/otuDADA2_NICUNEC_RSRC/Eukdetect_NICUNEC/work/cf/d6105c5a2cb3924b39fcafc00e4837/alignmentsSingle.sam  | cut -f 6 | sort | uniq -c  | sort -nr | head
   1930 150M
    137 20M
     64 21M
     24 22M
     22 125M21I4M
     12 44M6D106M
     11 23M
     10 4M12I134M
      9 84M6I60M
      9 7M1D143M

Just to demonstrate, add "grep lilacinum" and look at that part of the signal in the alignment:

[wbazant@node155 NICUNEC]$ grep -v @ /project/eupathdblab/workflows/MicrobiomeDB/5/data/otuDADA2_NICUNEC_RSRC/Eukdetect_NICUNEC/work/9c/af3f78d2c115c7113190db74b222bd/alignmentsSingle.sam | grep lilacinum | cut -f 6 | sort | uniq -c  | sort -nr | head 
      2 20M
      1 23M
[wbazant@node155 NICUNEC]$ grep -v @ /project/eupathdblab/workflows/MicrobiomeDB/5/data/otuDADA2_NICUNEC_RSRC/Eukdetect_NICUNEC/work/9c/af3f78d2c115c7113190db74b222bd/alignmentsSingle.sam | grep lilacinum | cut -f 6 | sort | uniq -c  | sort -nr | head   
[wbazant@node155 NICUNEC]$ grep -v @ /project/eupathdblab/workflows/MicrobiomeDB/5/data/otuDADA2_NICUNEC_RSRC/Eukdetect_NICUNEC/work/cf/d6105c5a2cb3924b39fcafc00e4837/alignmentsSingle.sam | grep lilacinum | cut -f 6 | sort | uniq -c  | sort -nr | head 
   1805 150M
     22 125M21I4M
     10 4M12I134M
      9 84M6I60M
      9 7M1D143M
      8 137M9I4M
      7 133M13I4M
      6 139M7I4M
      5 5M5I140M
      5 4M4I142M

"20M" is twenty matched bases, and I guess it's not long enough.

I don't think that filtering on quality scores helps - the "20M" matches score all over:

 grep -v @ /project/eupathdblab/workflows/MicrobiomeDB/5/data/otuDADA2_NICUNEC_RSRC/Eukdetect_NICUNEC/work/9c/af3f78d2c115c7113190db74b222bd/alignmentsSingle.sam  | cut -f 5,6 | grep 20M | sort | uniq -c 
     80 0       20M
     65 1       20M
     14 2       20M
    220 23      20M
    191 24      20M
      6 31      20M
      3 32      20M
      5 34      20M
    248 40      20M
     10 42      20M
     24 6       20M
     16 7       20M

wbazant / CORRAL

Update filtering rules #1