marcelm / cutadapt

Cutadapt removes adapter sequences from sequencing reads
https://cutadapt.readthedocs.io
MIT License
514 stars 129 forks source link

How to specify read 2 adapters if 5' and 3' adapters are different. #711

Closed FischyM closed 1 year ago

FischyM commented 1 year ago

My use case is that I have transposon insertions as paired-end reads that I'd like trimmed from 5' and 3' ends for both read 1 and read 2. For this setup, R1 has the transposon on 5' and an adapter on 3' end. I also want these sequences trimmed from read 2, but I am unsure exactly how I should give the sequences for options -G and -A. The examples said which reads go where for ADAPTER_FWD and ADAPTER_REV, but didn't give DNA representation for read 2 (I think?). Below are the sequences I want to trim from reads 1 and 2 and how I am currently running cutadapt (appropriate options for input and output are used but removed here for clarity).

Read 1:  5’ - GGATTAAATGTCAGGAATTGTGAAAA   [GENOMIC]   TACCCATACGACGTCCCAGA - 3’
Read 2:  3’ - TTTTCACAATTCCTGACATTTAATCC   [CIMONEG]   TCTGGGACGTCGTATGGGTA - 5’
tpn=GGATTAAATGTCAGGAATTGTGAAAA
adaptor=TACCCATACGACGTCCCAGA
tpn_rc=TTTTCACAATTCCTGACATTTAATCC
adaptor_rc=TCTGGGACGTCGTATGGGTA
cutadapt --discard-untrimmed -g ${tpn} -a ${adaptor} -A ${tpn_rc} -G ${adaptor_rc}

So should -A be given to cutadapt as -A TCTGGGACGTCGTATGGGTA or should it read from the 5' directions as -A ATGGGTATGCTGCAGGGTCT ? The same question applies to -G.

One other clarifying question. Will cutadapt only keep reads where both adapters at 5' and 3' ends were found, or do I need to run cutadapt in separate steps to remove 5' adapters first, then remove 3' adapters, or be using --pair-adapters? The documentation on Paired adapters was a little confusing (also seems like a typo and should be -A TTTT)

cutadapt --pair-adapters -a AAAAA -a GGGG -A CCCCC -a TTTT -o out.1.fastq -p out.2.fastq in.1.fastq in.2.fastq

vs.

cutadapt --discard-untrimmed -g ${tpn} -G ${adaptor_rc}
cutadapt --discard-untrimmed -a ${adaptor} -A ${tpn_rc}

or

cutadapt --discard-untrimmed --pair-adapters -g ${tpn} -G ${adaptor_rc} -a ${adaptor} -A ${tpn_rc}

Thanks for your time.

marcelm commented 1 year ago

Hi, the basic rule is that Cutadapt just searches for the given adapter unchanged in the read. It doesn’t do any reverse complementing so if you want to search for a reverse-complemented sequence in R2, you need to provide the reverse-complemented sequence as the adapter to search for.

I’m confused by the second line in your illustration:

Read 1:  5’ - GGATTAAATGTCAGGAATTGTGAAAA   [GENOMIC]   TACCCATACGACGTCCCAGA - 3’
Read 2:  3’ - TTTTCACAATTCCTGACATTTAATCC   [CIMONEG]   TCTGGGACGTCGTATGGGTA - 5’

If I interpret correctly, then I think you should either write it like this:

Read 1:  5’ - GGATTAAATGTCAGGAATTGTGAAAA   [GENOMIC]   TACCCATACGACGTCCCAGA - 3’
Read 2:  3’ - CCTAATTTACAGTCCTTAACACTTTT   [CIMONEG]   ATGGGTATGCTGCAGGGTCT - 5’

Then you see the basepairing (the second line is complemented, but not reversed).

Or you write it like this:

Read 1:  5’ - GGATTAAATGTCAGGAATTGTGAAAA   [GENOMIC]   TACCCATACGACGTCCCAGA - 3’

Read 2:  5’ - ATGGGTATGCTGCAGGGTCT [CIMONEG] CCTAATTTACAGTCCTTAACACTTTT - 3’

This would describe what comes off the sequencer (that is, what is in the FASTQ files).

Is the "adapter" actually a primer? If so, it appears to me the molecule on the sequencer looks like this:

5'sequencing-adapter - transposon - genomic-sequence - primer - 3'sequencing-adapter

R1 is sequenced in the forward direction starting after the 5' sequencing adapter, so you get:

R1: transposon - genomic-sequence - primer - 3'sequencing-adapter

R2 is sequenced in the reverse direction starting from just before the 3' adapter, so you get:

R2: 3'primer-revcomp - genomic-sequence-revcomp - transposon-revcomp - 5'sequencing-adapter-revcomp

It appears you want to trim everything except the genomic-sequence(-revcomp). To remove both a 5' and a 3' adapter at the same time, you can use linked adapters. This would look something like this:

cutadapt -g ${transposon}...${primer} -G ${transposon_revcomp}...${primer_revcomp}

I’m not sure whether my understanding of your molecule layout is correct, but maybe you can adjust it for your case with the explanations I gave.

Variations: If the transposon is always in full at the beginning of the read, preceed it with ^. So use -g ^${transposon}...${primer etc. If the primer may not always occur in full, use -a instead of -g.

Will cutadapt only keep reads where both adapters at 5' and 3' ends were found,

In short: Yes. In detail, if you use --discard-untrimmed, then the read pair will be discarded if no adapter was found on R1 or on R2. Then the question is what "no adapter found" means for linked adapters because they have two parts and it can happen that one part is found but not the other. See the documentation for a description (keyword: "required" and "optional").

or do I need to run cutadapt in separate steps to remove 5' adapters first, then remove 3' adapters, or be using --pair-adapters? The documentation on Paired adapters was a little confusing (also seems like a typo and should be -A TTTT)

Paired adapters are mainly for demultiplexing Illumina dual indices. They are not needed in case you just want to remove adapters from paired-end reads (which is why the section is maybe confusing, sorry about that).

also seems like a typo and should be -A TTTT

Thanks, fixed!

FischyM commented 1 year ago

Thanks for your detailed response, it was very helpful.

Thanks again for taking the time to help, it really cleared up cutadapt as well as my own understanding of paired-end sequencing.

marcelm commented 1 year ago

Thanks for the clarification and you’re welcome!

  • I don't believe a linked adapter would work in my use case, as the genomic regions can potentially be up to 800 bp long, which is longer than an Illumina read.

Linked adapters are intended for this use case: If you specify your linked adapter with -a ^${tpn}...${primer}, then the 3' adapter (${primer}) is optional. If it is not found, it will just not be trimmed. The read will only be considered "untrimmed" if the 5' adapter is missing (so --discard-untrimmed works as intended).

cutadapt --discard-untrimmed -g ^${tpn} -G ${primer_c} -o ${trim1_f} -p ${trim1_r} ${read_f} ${read_r}
cutadapt -a ${primer} -A ${tpn_c} -o ${trim2_f} -p ${trim2_r} ${trim1_f} ${trim1_r}

With linked adapters, you can combine this into one command:

cutadapt --discard-untrimmed -a ^${tpn}...${primer} -A ^${primer_c}...${tpn_c} -o ${trim1_f} -p ${trim1_r} ${read_f} ${read_r}
FischyM commented 1 year ago

Ok, I think I see how this is supposed to work. I originally didn't think that read 2 would also be anchored and the way to specify the linked adapters is cleared up for me. I'll close this issue after I rerun my workflow and verify it's all working. Thanks again!

FischyM commented 1 year ago

Closing this issue, but specifying both linked adapters as anchors did not work for anyone coming across this. I don't have access to the complete protocol used for the data I'm working on, but using the same command without anchors returned the most reads for me. The linked adapter was definitively the right way to go as I'm pretty confident with the output generated.