feature request: use false positive rate instead of error rate?

jsh58 / NGmerge

Merging paired-end reads and removing adapters

MIT License

45 stars 15 forks source link

feature request: use false positive rate instead of error rate? #17

Open eboyden opened 3 years ago

eboyden commented 3 years ago

Hi, I'm a big fan of this software but was wondering if it might make sense to provide the option to threshold based on a false positive rate instead of error rate (similar to what SeqPurge does using the binomial distribution calculation), since longer overlaps should be more tolerant of higher error rates. We've found that we obtain the best performance when piping multiple instances of NGmerge to grossly simulate this effect; e.g. to simulate a 1E-6 FP threshold, we allow 8% errors for overlaps of 10-14 bp, 17% errors for overlaps of 15-19 bp, and 23% errors for overlaps of 20+ bp. But obviously this is still overly stringent for longer overlaps, not to mention time consuming.

jsh58 commented 3 years ago

Thanks for the question. This is an interesting topic that requires two separate answers, for the two modes of NGmerge:

In stitch mode, I have found that relaxing the allowed errors (increasing -p) causes increased false positives -- that is, placing reads in an incorrect overlapping alignment. This occurs, for example, with reads derived from genomes with numerous pseudo-repetitive regions. In such cases, longer overlaps should not necessarily be more tolerant of errors, and what you suggest would worsen the situation.
In adapter-removal mode, there is an additional check that can be made: the putative adapter sequences can be examined via the -c <file>. As stated in the description of the -c <file>:

If the sequences that appear in the 'Adapter' columns are not consistent, they may be false positives, and one should consider decreasing -p or increasing -e.

eboyden commented 3 years ago

To your first point, the risk really depends on the dataset and how one is using NGmerge. For example, not only do we use it to trim dovetails of otherwise good read pairs, we also sometimes use it in stitch mode with impossibly high -m but low -e to stitch and remove dovetailed reads, allowing only unstitched (undovetailed) reads to pass forward. In this case, we're willing to tolerate a slightly higher FP stitching rate if it means cleaner data. But being able to tune the FP rate directly (with an error rate that automatically adjusts as a function of overlap length) would be preferable to only being able to tune the error rate and minimum overlap.
To your second point, this only works when the "adapters" are consistent, e.g. sequencing adapters for a shotgun library. For some types of amplicon sequencing, when the 5' primer sequences have already been removed from the reads, the 3' dovetails will be the reverse complements of those primer sequences, and therefore they will be inconsistent by design.

In any case, thanks for the response and the software. I understand that implementing feature requests is time consuming and not always a high priority - just letting you know there's interest if you (or anyone) were inclined.