Closed ErminZ closed 4 months ago
Hi @ErminZ - this is expected for dorado duplex
since we don't perform any explicit adapter/primer trimming in that pipeline right now. However, due to how duplex works, the duplex reads themselves will have the adapters and primers trimmed off. But their simplex parents will still contain them, which is why you see such a high percentage.
You can run the output reads through dorado trim
to remove the adapter/primers from the simplex parents.
P.S. - Currently duplex isn't designed to work well with amplicon data. However, since your amplicons are pretty long, it might not be an issue.
Thank you for your reply @tijyojwad! It's beneficial to gain clarity on the output reads. I have a follow-up question: Will the default 'dorado duplex' trim primers affect the simplex reads that are not related to a duplex read? Our samples usually have a duplex rate of 30% and around 10 % of the reads are duplex reads, shown as follows.
[2024-02-06 19:53:48.027] [info] > Simplex reads basecalled: 3902930
[2024-02-06 19:53:48.029] [info] > Simplex reads filtered: 395275
[2024-02-06 19:53:48.029] [info] > Duplex reads basecalled: 487828
[2024-02-06 19:53:48.029] [info] > Duplex reads filtered: 2645
[2024-02-06 19:53:48.029] [info] > Duplex rate: 26.714087%
[2024-02-06 19:53:48.030] [info] > Basecalled @ Bases/s: 1.296412e+06
I have been focusing on using reads with the correct PCR primers configuration to enhance the precision of structural variants detection. However, I am still in the process of interpreting the data presented below. If the dorado duplex
do NOT trim primers of simplex reads are not related to any duplex reads, can I infer that the 'single_primer1', 'single_primer2', and 'primer1_primer2' reads together (25% of total simplex reads) are of good quality? Are 'no_primers' reads also good?
The primers configuration of simplex reads after removing 486k duplex reads
other(contains>2 primers) 1499081
double_primer1 1170891
single_primer1 729083
single_primer2 198589
no_primers 192723 (5% of total simplex reads)
double_primer2 143854
primer1_primer2 89884
Hi @ErminZ - dorado duplex
will not trim anything from the simplex reads (either primers or barcodes). So all the simplex reads should be "good" quality (I would probably also filter based on the mean q score qs:i
just to be sure). Whether or not they are usable for you pipeline depends on what characteristic of the reads you're looking for, but in general I wouldn't say there's anything wrong with them.
Reads still contain PCR primers after the default base-calling trimming
Hello, thank you for your development of the new basecaller to improve the read quality. I have a question regarding the primer trimming step, including detecting and removing PCR primers. Could you help me to understand the following questions?
PCR primers in most reads in amplicon-based DNA sequencing
An amplicon-based DNA dataset, using Dorado duplex basecalling to generate FASTQ file with default primers trimming. But still, a lot of reads contain primer sequences using grep
zcat input.fastq.gz | grep forward_primer_sequences | grep reverse_primer_sequences | wc -l
.Also, about 15% of reads contain wired PCR primer combinations, like a read containing multiple a, or a read containing both a and d. In theory, a DNA molecule looks like the following
5'-a----b->3' 3'<-c--d-5' a. Forward primers; b. Reverse primer; c. Reverse complimentary of forward primer; d. Reverse complimentary of the reverse primer.
47% of the reads contain forward primers (a or c); 19% of reads contain reverse primers (b or d); 3% of reads contain correct both.
Could you explain why after Dorado primer trimming, still a lot of reads containing PCR primers?
Is it a good idea to filter out reads containing wired primer combinations (possible sequencing artifacts and contatemers?) to improve the accuracy of downstream mutation (structural variation) detection.
Run environment:
dorado duplex sup ${pod5_directory} > ${runid}_R1_001.fastq --min-qscore 10 --emit-fastq