Closed FischyM closed 1 year ago
Hi, the basic rule is that Cutadapt just searches for the given adapter unchanged in the read. It doesn’t do any reverse complementing so if you want to search for a reverse-complemented sequence in R2, you need to provide the reverse-complemented sequence as the adapter to search for.
I’m confused by the second line in your illustration:
Read 1: 5’ - GGATTAAATGTCAGGAATTGTGAAAA [GENOMIC] TACCCATACGACGTCCCAGA - 3’
Read 2: 3’ - TTTTCACAATTCCTGACATTTAATCC [CIMONEG] TCTGGGACGTCGTATGGGTA - 5’
If I interpret correctly, then I think you should either write it like this:
Read 1: 5’ - GGATTAAATGTCAGGAATTGTGAAAA [GENOMIC] TACCCATACGACGTCCCAGA - 3’
Read 2: 3’ - CCTAATTTACAGTCCTTAACACTTTT [CIMONEG] ATGGGTATGCTGCAGGGTCT - 5’
Then you see the basepairing (the second line is complemented, but not reversed).
Or you write it like this:
Read 1: 5’ - GGATTAAATGTCAGGAATTGTGAAAA [GENOMIC] TACCCATACGACGTCCCAGA - 3’
Read 2: 5’ - ATGGGTATGCTGCAGGGTCT [CIMONEG] CCTAATTTACAGTCCTTAACACTTTT - 3’
This would describe what comes off the sequencer (that is, what is in the FASTQ files).
Is the "adapter" actually a primer? If so, it appears to me the molecule on the sequencer looks like this:
5'sequencing-adapter - transposon - genomic-sequence - primer - 3'sequencing-adapter
R1 is sequenced in the forward direction starting after the 5' sequencing adapter, so you get:
R1: transposon - genomic-sequence - primer - 3'sequencing-adapter
R2 is sequenced in the reverse direction starting from just before the 3' adapter, so you get:
R2: 3'primer-revcomp - genomic-sequence-revcomp - transposon-revcomp - 5'sequencing-adapter-revcomp
It appears you want to trim everything except the genomic-sequence(-revcomp). To remove both a 5' and a 3' adapter at the same time, you can use linked adapters. This would look something like this:
cutadapt -g ${transposon}...${primer} -G ${transposon_revcomp}...${primer_revcomp}
I’m not sure whether my understanding of your molecule layout is correct, but maybe you can adjust it for your case with the explanations I gave.
Variations: If the transposon is always in full at the beginning of the read, preceed it with ^
. So use -g ^${transposon}...${primer
etc. If the primer may not always occur in full, use -a
instead of -g
.
Will cutadapt only keep reads where both adapters at 5' and 3' ends were found,
In short: Yes. In detail, if you use --discard-untrimmed
, then the read pair will be discarded if no adapter was found on R1 or on R2. Then the question is what "no adapter found" means for linked adapters because they have two parts and it can happen that one part is found but not the other. See the documentation for a description (keyword: "required" and "optional").
or do I need to run cutadapt in separate steps to remove 5' adapters first, then remove 3' adapters, or be using --pair-adapters? The documentation on Paired adapters was a little confusing (also seems like a typo and should be -A TTTT)
Paired adapters are mainly for demultiplexing Illumina dual indices. They are not needed in case you just want to remove adapters from paired-end reads (which is why the section is maybe confusing, sorry about that).
also seems like a typo and should be -A TTTT
Thanks, fixed!
Thanks for your detailed response, it was very helpful.
You're right, the correct way to write the example paired-end transposon would be as follows, thanks for the help!
Read 1: 5’ - GGATTAAATGTCAGGAATTGTGAAAA [GENOMIC] TACCCATACGACGTCCCAGA - 3’
Read 2: 3’ - CCTAATTTACAGTCCTTAACACTTTT [CIMONEG] ATGGGTATGCTGCAGGGTCT - 5’
To clarify, my original use of the term "adapter" was not intended to refer to the Illumina adapters, but to a custom adapter sequence that you relabeled "primer". Sticking with your nomenclature, the molecule would look like this:
5'sequencing-adapter - transposon - genomic-sequence - primer/custom-adapter - 3'sequencing-adapter
and the fasta files reads should then look like this
R1: transposon - genomic-sequence - primer - 3'sequencing-adapter
R2: primer-revcomp - genomic-sequence-revcomp - transposon-revcomp - 5'sequencing-adapter-revcomp
I don't believe a linked adapter would work in my use case, as the genomic regions can potentially be up to 800 bp long, which is longer than an Illumina read.
Therefore, I believe the arguments I should pass to cutadapt are as follows
tpn=GGATTAAATGTCAGGAATTGTGAAAA
primer=TACCCATACGACGTCCCAGA
tpn_c=CCTAATTTACAGTCCTTAACACTTTT
primer_c=ATGGGTATGCTGCAGGGTCT
cutadapt --discard-untrimmed -g ^${tpn} -G ${primer_c} -o ${trim1_f} -p ${trim1_r} ${read_f} ${read_r}
cutadapt -a ${primer} -A ${tpn_c} -o ${trim2_f} -p ${trim2_r} ${trim1_f} ${trim1_r}
Since the genomic region can be larger than the typical Illumina 150 bp reads, it is possible that the 3' sequencing adapter would not be present in read 1 or read 2, however there should always be a 5' sequencing adapter.
Thanks again for taking the time to help, it really cleared up cutadapt as well as my own understanding of paired-end sequencing.
Thanks for the clarification and you’re welcome!
- I don't believe a linked adapter would work in my use case, as the genomic regions can potentially be up to 800 bp long, which is longer than an Illumina read.
Linked adapters are intended for this use case: If you specify your linked adapter with -a ^${tpn}...${primer}
, then the 3' adapter (${primer}
) is optional. If it is not found, it will just not be trimmed. The read will only be considered "untrimmed" if the 5' adapter is missing (so --discard-untrimmed
works as intended).
cutadapt --discard-untrimmed -g ^${tpn} -G ${primer_c} -o ${trim1_f} -p ${trim1_r} ${read_f} ${read_r} cutadapt -a ${primer} -A ${tpn_c} -o ${trim2_f} -p ${trim2_r} ${trim1_f} ${trim1_r}
With linked adapters, you can combine this into one command:
cutadapt --discard-untrimmed -a ^${tpn}...${primer} -A ^${primer_c}...${tpn_c} -o ${trim1_f} -p ${trim1_r} ${read_f} ${read_r}
Ok, I think I see how this is supposed to work. I originally didn't think that read 2 would also be anchored and the way to specify the linked adapters is cleared up for me. I'll close this issue after I rerun my workflow and verify it's all working. Thanks again!
Closing this issue, but specifying both linked adapters as anchors did not work for anyone coming across this. I don't have access to the complete protocol used for the data I'm working on, but using the same command without anchors returned the most reads for me. The linked adapter was definitively the right way to go as I'm pretty confident with the output generated.
My use case is that I have transposon insertions as paired-end reads that I'd like trimmed from 5' and 3' ends for both read 1 and read 2. For this setup, R1 has the transposon on 5' and an adapter on 3' end. I also want these sequences trimmed from read 2, but I am unsure exactly how I should give the sequences for options -G and -A. The examples said which reads go where for ADAPTER_FWD and ADAPTER_REV, but didn't give DNA representation for read 2 (I think?). Below are the sequences I want to trim from reads 1 and 2 and how I am currently running cutadapt (appropriate options for input and output are used but removed here for clarity).
So should -A be given to cutadapt as
-A TCTGGGACGTCGTATGGGTA
or should it read from the 5' directions as-A ATGGGTATGCTGCAGGGTCT
? The same question applies to -G.One other clarifying question. Will cutadapt only keep reads where both adapters at 5' and 3' ends were found, or do I need to run cutadapt in separate steps to remove 5' adapters first, then remove 3' adapters, or be using --pair-adapters? The documentation on Paired adapters was a little confusing (also seems like a typo and should be -A TTTT)
vs.
or
Thanks for your time.