Open eboyden opened 3 years ago
Thanks for the question. This is an interesting topic that requires two separate answers, for the two modes of NGmerge:
-p
) causes increased false positives -- that is, placing reads in an incorrect overlapping alignment. This occurs, for example, with reads derived from genomes with numerous pseudo-repetitive regions. In such cases, longer overlaps should not necessarily be more tolerant of errors, and what you suggest would worsen the situation.-c <file>
. As stated in the description of the -c <file>
:
If the sequences that appear in the 'Adapter' columns are not consistent, they may be false positives, and one should consider decreasing
-p
or increasing-e
.
In any case, thanks for the response and the software. I understand that implementing feature requests is time consuming and not always a high priority - just letting you know there's interest if you (or anyone) were inclined.
Hi, I'm a big fan of this software but was wondering if it might make sense to provide the option to threshold based on a false positive rate instead of error rate (similar to what SeqPurge does using the binomial distribution calculation), since longer overlaps should be more tolerant of higher error rates. We've found that we obtain the best performance when piping multiple instances of NGmerge to grossly simulate this effect; e.g. to simulate a 1E-6 FP threshold, we allow 8% errors for overlaps of 10-14 bp, 17% errors for overlaps of 15-19 bp, and 23% errors for overlaps of 20+ bp. But obviously this is still overly stringent for longer overlaps, not to mention time consuming.