marcelm / cutadapt

Cutadapt removes adapter sequences from sequencing reads
https://cutadapt.readthedocs.io
MIT License
515 stars 129 forks source link

5' adapters that are incomplete at the tail #550

Open lokapal opened 3 years ago

lokapal commented 3 years ago

Hello, Marcel!

Are there any chance to remove incomplete 5' adapters that are incomplete at the tail, not head? E.g. I have MYVERYLONGADAPTER and I have a lot of reads like

read1 MYVERYLONGAmysequence1 read2 MYVERYLOmysequence2

Adapters are long REALLY so in any case the part that left is longer than 30bp and is quite unique. I don't know the length of adapter that is left precisely in any sequence. Surely I can supply cutadapt with the list of 40 or so adapters that are all between minimal and full length but how it will be treated by cutadapt? I mean something like: -g MYVERYLONGADAPTER \ -g MYVERYLONGADAPTE \ -g MYVERYLONGADAPT \ -g MYVERYLONGADAP \ -g MYVERYLONGADA \ -g MYVERYLONGAD \ -g MYVERYLONGA \ -g MYVERYLONG \ -g MYVERYLON \ -g MYVERYLO

Should I put them from the longer to the shorter?

Thanks in advance!

marcelm commented 3 years ago

Interesting, can you elaborate a little bit on why you get this type of data? I just wonder whether it would be worth adding support for this to Cutadapt (no promises, though).

In any case, you’ll currently have to provide all possible adapter prefixes manually, similar to what you did above. However, you should follow these recommendations, that is,

If you cannot follow the above, trimming will be quite slow (but it’ll still work). Also, you can put the sequences in a FASTA file for convenience.

Should I put them from the longer to the shorter?

It should not matter in this case in which order you provide the sequences.

lokapal commented 3 years ago

As a matter of fact my current reads have THREE adapters: one of them is full, the second is broken (sometimes at the head with the end presenting, sometimes at the tail with the head presenting), the third is Illumina/PE adapter usually (but not always!) at 3'. It's 4C libraries sequenced (the other experiment, but the technology is the same basically): https://www.sciencedirect.com/science/article/pii/S1046202318304742. Two adapters: A1 and A2, A1D means A1 direct, A1RC means A1 reverse complement, A2D means A2 direct, A2RC means reverse complement.

The examples of reads (marked up) are in the attached file reads.fa.gz

y9c commented 3 years ago

Hi @marcelm, I have a similar problem, but it is more common one.

Given a sequence with 5' adapter, eg ALONGADAPTORsequence, if sequence is low quality in the end, or has polyG, cutadapt will trim this sequence into ALONGADAPTORseq (1st case) or ALONGADAP (2nd case). Then the -g argument and remove the adapter in the 1st case, but not in the 2nd case. And will cause adaptor contamination in the filtered reads.

marcelm commented 3 years ago

@yech1990 Thanks for reporting! I have opened a separate issue (#565) as this needs to be fixed in a different way than the problem that @lokapal has.