rrwick / Porechop

adapter trimmer for Oxford Nanopore reads
GNU General Public License v3.0
322 stars 123 forks source link

definition of "end" vs "middle" adapters #31

Closed jvolkening closed 6 years ago

jvolkening commented 6 years ago

Hi Ryan,

I'm processing a MinION dataset produced from an SQK-LSK108 kit and attempting to use Porechop to demux and trim adapters. It seems to be doing a reasonable job of identifying the adapters/barcodes based on a visual inspection of verbose output (nice highlighting), although I think I will need to tweak adapters.py slightly to match the adapter sequences listed in the albacore source. However, ~ 85% of reads are being discarded "based on middle adapters". These are predominantly "short" reads as far as nanopore reads go because they are based on an RT-PCR amplicon, with median length ~ 800bp.

The verbose output shows that for most reads, adapters are identified a short distance from, but not exactly at, the ends. However, these seem to be considered middle adapters. Is there a threshold defined somewhere that controls what is considered an "end" vs "middle" adapter? If so, is this threshold/distance absolute or relative to read length?

jvolkening commented 6 years ago

User error... I misinterpreted the use of min_trim_size, even though the docs are pretty clear.

rrwick commented 6 years ago

Even though you closed this issue, I think there is an underlying issue I probably need to address.

Porechop does not have a threshold for what counts as 'far enough in the middle'. Rather, 'middle' is simply everything that's left over after the ends are trimmed off. The reason this can be an issue is I've seen cases where a read has multiple adapters on its end - maybe the same adapter got ligated on twice.

So if a read has two copies of the same barcode on the end, the following may happen: 1) the first adapter is trimmed off as a normal end adapter 2) the second adapter is treated as a middle adapter and the read is discarded

Perhaps to address this, Porechop should do multiple passes of end-adapter trimming until no more is removed. It could avoid this problem but would be a bit slower. I'll leave this issue closed and make some notes for myself to rethink this one later. Thanks!