marcelm / cutadapt

Cutadapt removes adapter sequences from sequencing reads
https://cutadapt.readthedocs.io
MIT License
523 stars 130 forks source link

trimmed first read ends up in --untrimmed_output #590

Closed FabianRoger closed 2 years ago

FabianRoger commented 2 years ago
This is cutadapt 3.3 with Python 3.8.2

cutadapt -a 'GGWACWGGWTGAACWGTWTAYCCYCC;required'...TGRTTYTTYGGNCAYCCNGARGTNTA -A 'TANACYTCNGGRTGNCCRAARAAYCA;required'...GGRGGRTAWACWGTTCAWCCWGTWCC 
--untrimmed-output /untrimmed/file_F_.fastq.gz 
--untrimmed-paired-output /untrimmed_COI/file_R_.fastq.gz 
--minimum-length=100 
--cores=8 
-o trimmed/file_F_.fastq.gz 
-p trimmed/file_R_.fastq.gz 
raw/file_F_.fastq.gz 
raw/file_R_.fastq.gz 

I use the command above to filter and trim paired reads that have been amplified with a specific primer pair and have been pooled with reads amplified with a different primer pair during sequencing. The aim is thus to demultiplex based on primer and trim the primers in the process.

However I need to keep the untrimmed reads which will be processed further, too.

I use linked adapters withe PRIMER....REVERSECOMPLEMENT because the amplified fragment can be shorter than 300 bp, so the reverse complement of the primers in forward direction can also be present on the reads.

The code works as expected (it filters out the right reads) but for some reason that I do not understand - and only for the forward read - it seems to trim the adaptor from the reads that gets redirected to the untrimmed output and not trim the adaptor from the read that is kept.

If it isn't a know problem or obvious where I went wrong, I can provide a reproducible example.

Thanks!

marcelm commented 2 years ago

I want to first make sure that you are using the right adapter type. You specify the 5' part of the R1 adapter as GGWACWGGWTGAACWGTWTAYCCYCC;required, that is, it is not anchored. This means that 1) bases can precede the adapter/primer occurrence and 2) the occurrence can be partial. Partial occurrences can be as short as YCC since the minimum overlap length is 3 by default. Is that you what you intended? Usually, the primers occur right at the beginning of a read (and the occurrences are in full), which is why the docs sugest to prefix the 5' adapter by ^ (and then the adapter automatically becomes required).

it seems to trim the adaptor from the reads that gets redirected to the untrimmed output and not trim the adaptor from the read that is kept.

Did you inspect both reads when you looked at the output files? Perhaps this has to do with how filtering of paired-end reads works. By default, a read pair is counted as "untrimmed" if R1 or R2 or both are untrimmed. In particular, if R1 is trimmed, but R2 is not, the pair is considered to be untrimmed and the read pair ends up in the --untrimmed-output files.

A reproducible example would be good if this doesn’t explain the problem.

FabianRoger commented 2 years ago

Thanks for the quick answer!

I am aware that I can anchor the adapter but in my experience, after de-multiplexing, there are often trailing bases before the forward primer (i.e. 3 bp, then the forward primer). This is also the case here (I am unsure why - these are not my files & I haven't done the demultiplexing). Therefore anchoring doesn't work as long as it is not possible to specify a region in which it should occur (i.e. starting within the first n bases).

The minimum overlap however is a good point, I probably should specify a minimum overlap of the length of the primer for the first of the linked adapters.

In particular, if R1 is trimmed, but R2 is not, the pair is considered to be untrimmed and the read pair ends up in the --untrimmed-output files.

yes, this is the behavior I want. However, I would expect that if the conditions for trimming are not met, neither of the pairs would be trimmed? I.e. I would not expect the adapter be trimmed form R1 and, if not found in R2, to be send to untrimmed without the adapter?

I will try to make a reproducible example.

marcelm commented 2 years ago

Therefore anchoring doesn't work as long as it is not possible to specify a region in which it should occur (i.e. starting within the first n bases).

Setting an appropriate minimum overlap is probably good enough, but you could achieve this by prefixing the adapter sequence with a number of N bases: -a N{5}GGWACWG (etc.). This isn’t anchoring, though, so you’d still need the ;required.

I would expect that if the conditions for trimming are not met, neither of the pairs would be trimmed?

You mean "neither of the reads"? No, the trimming happens independently.

This is a current limitation in Cutadapt, somewhat related to #332: Cutadapt at the moment does not really make use of the fact that R1 and R2 are different ends of the same molecule.

You can add --pair-adapters, though, to get closer to the behavior you describe: Then either none or both reads will be trimmed. (The documentation should maybe make this use case clearer.) What can still happen is that R1 and R2 are trimmed inconsistently, for example, that the trimmed R1 has a different length from R2, but I’d argue that this would be rare (and you can alleviate it a bit by increasing the maximum allowed error rate).

FabianRoger commented 2 years ago

Adding the --pair-adapters flag indeed solves the trimming of the redirected reads.

For the not trimming of the "trimmed reads" - the behavior that really puzzled me - it was an error on my side. The forward reads are (for some reason) 326 bp long, so the length of 300 bp after trimming was trimmed. Sorry for that.

Thanks for your help!