N-prefixes / wildcards before an adapter sequence leads to shortening *all* reads at the 3'-end ?

plijnzaad commented 4 years ago

I have been playing around with N{number} prefixes before an adaptor in order to get rid of the adapter and, say, (up to) 14 nucleotides before it. However, it turns out that e.g. using -a 'N{6}GATCGTCGGACTGTAGAACTCTGAAC' (and also the equivalent NNNNNNGATCGTCGGACTGTAGAACTCTGAAC) leads to shortening all reads that do not contain any adapter whatsoever by the length of the N-prefix ?! I.e. the sequence ^ATATGCGC$ gets shortened to ^ATAT$ using adapter 'NNNNTGCA', more or less as if the adapter is 'slid backwards' over read until it mismatches, and then the cut is made. The incantation (atropos version 2.0.0a5.post20200601, python 3.6.1) I used was

atropos -a strange=NNNNTGCA \
       --progress none \
       --report-file short-atroposreport.yaml \
       --single-input shortcase.sam \
       --output-format sam \
       --log-level INFO \
       > shortcase-trimmed.sam 2> shortcase.log

The input and result files are attached (all renamed to *txt because github won't allow me otherwise)

Is this the way it is supposed to work or am I doing something wrong? I think this used not to be the behaviour.

(Incidentally, it would be really useful if the N{k} syntax would be Perl-regexp-like, so that you can supply a range of lengths for the wild-card region.)

jdidion commented 4 years ago

Thanks for this reproducible report. That is not the intended behavior.

jdidion commented 4 years ago

Ah, I see the issue - you are using -a, which is for matching adapters at the end of the read. Try using -g instead.

There is also the concept of non-internal adapters (https://cutadapt.readthedocs.io/en/stable/guide.html#non-internal-5-and-3-adapters). I will open an issue to port this behavior over from Cutadapt.

jdidion commented 4 years ago

@plijnzaad please reopen if this doesn't solve your issue.

jdidion / atropos

N-prefixes / wildcards before an adapter sequence leads to shortening all reads at the 3'-end ? #110

jdidion / atropos

N-prefixes / wildcards before an adapter sequence leads to shortening *all* reads at the 3'-end ? #110

N-prefixes / wildcards before an adapter sequence leads to shortening all reads at the 3'-end ? #110