marcelm / cutadapt

Cutadapt removes adapter sequences from sequencing reads
https://cutadapt.readthedocs.io
MIT License
523 stars 130 forks source link

[Question] linked adapters from Fasta file and discarding untrimmed pairs #806

Open marchoeppner opened 2 months ago

marchoeppner commented 2 months ago

Hi,

sorry, this is more of a question (or a feature request, we'll see).

We are processing amplicon data, from which we want to trim PCR primers. The amplicon is shorter than the invidual paired-end reads, which creates some challenges - meaning R1 and R2 both contain the fwd AND the rev primer sequence, requiring a specific trimming approach.

The primers may or may not be degenerate and we are generating a disambuated fasta file prior to trimming, as well as a reverse complemented one (the --revcomp function didn't seem to do what we expected it to do in our case).

Some dummy code of what this looks like at the moment:

options_5p = "-g file:${primers} -G file:${primers}"
options_3p = "-a file\$:${primers_rc} -A file\$:${primers_rc}"

 cutadapt --cores $task.cpus \\
            --discard-untrimmed \\
            --revcomp \\
            $args \\
            $reads \\
            $trimmed \\
            $options_5p \\
            $options_3p \\
            --times=2 \\
            -Z \\
            --json=$report \\
            > ${prefix}.cutadapt.log

This currently as an issue in that it does not guarantee reads to be discarded that are not trimmed on both ends in both R1 and R2. The alternative would be to pipe the whole process, first in forward direction, and then reverse - each discarding any untrimmed reads. But if we do that, we do not get the nice JSON report.

I am guessing what we would want is something that behaves like a linked adapter. But that does not seem to apply in our case since we have any number of disambiguated primer sequences. Am I missing anything here?

marcelm commented 2 months ago

Can you clarify what you mean by "disambiguated FASTA file"? In which way are the primers degenerate? Note that adapter/primer sequences can contain IUPAC wildcards, maybe that helps?