ErminZ commented 5 months ago

Reads still contain PCR primers after the default base-calling trimming

Hello, thank you for your development of the new basecaller to improve the read quality. I have a question regarding the primer trimming step, including detecting and removing PCR primers. Could you help me to understand the following questions?

How many bases from the ends of the read does Dorado detect primers?
How are the sequences of PCR primers detected by Dorado?
Why after default basecalling, does a significant proportion of reads (> 70%) still contain PCR primers using grep?

PCR primers in most reads in amplicon-based DNA sequencing

An amplicon-based DNA dataset, using Dorado duplex basecalling to generate FASTQ file with default primers trimming. But still, a lot of reads contain primer sequences using grep zcat input.fastq.gz | grep forward_primer_sequences | grep reverse_primer_sequences | wc -l.

Also, about 15% of reads contain wired PCR primer combinations, like a read containing multiple a, or a read containing both a and d. In theory, a DNA molecule looks like the following

5'-a----b->3' 3'<-c--d-5' a. Forward primers; b. Reverse primer; c. Reverse complimentary of forward primer; d. Reverse complimentary of the reverse primer.

47% of the reads contain forward primers (a or c); 19% of reads contain reverse primers (b or d); 3% of reads contain correct both.

Could you explain why after Dorado primer trimming, still a lot of reads containing PCR primers?

Is it a good idea to filter out reads containing wired primer combinations (possible sequencing artifacts and contatemers?) to improve the accuracy of downstream mutation (structural variation) detection.

Run environment:

Dorado version: v0.5.2
Dorado command: dorado duplex sup ${pod5_directory} > ${runid}_R1_001.fastq --min-qscore 10 --emit-fastq
Hardware (CPUs, Memory, GPUs): aws batch gpu
Source data type: pod5
Source data location: aws s3
Details about data: flow cell R10.4.1;kit SQK-LSK114, read lengths: amplicon 10 kb, but N50 is shorter than 10kb, number of reads: 5 million reads, total dataset size 10 GB.

tijyojwad commented 5 months ago

Hi @ErminZ - this is expected for dorado duplex since we don't perform any explicit adapter/primer trimming in that pipeline right now. However, due to how duplex works, the duplex reads themselves will have the adapters and primers trimmed off. But their simplex parents will still contain them, which is why you see such a high percentage.

You can run the output reads through dorado trim to remove the adapter/primers from the simplex parents.

P.S. - Currently duplex isn't designed to work well with amplicon data. However, since your amplicons are pretty long, it might not be an issue.

ErminZ commented 5 months ago

Thank you for your reply @tijyojwad! It's beneficial to gain clarity on the output reads. I have a follow-up question: Will the default 'dorado duplex' trim primers affect the simplex reads that are not related to a duplex read? Our samples usually have a duplex rate of 30% and around 10 % of the reads are duplex reads, shown as follows.

[2024-02-06 19:53:48.027] [info] > Simplex reads basecalled: 3902930
[2024-02-06 19:53:48.029] [info] > Simplex reads filtered: 395275
[2024-02-06 19:53:48.029] [info] > Duplex reads basecalled: 487828
[2024-02-06 19:53:48.029] [info] > Duplex reads filtered: 2645
[2024-02-06 19:53:48.029] [info] > Duplex rate: 26.714087%
[2024-02-06 19:53:48.030] [info] > Basecalled @ Bases/s: 1.296412e+06

I have been focusing on using reads with the correct PCR primers configuration to enhance the precision of structural variants detection. However, I am still in the process of interpreting the data presented below. If the dorado duplex do NOT trim primers of simplex reads are not related to any duplex reads, can I infer that the 'single_primer1', 'single_primer2', and 'primer1_primer2' reads together (25% of total simplex reads) are of good quality? Are 'no_primers' reads also good?

The primers configuration of simplex reads after removing 486k duplex reads
other(contains>2 primers)              1499081
double_primer1    1170891
single_primer1     729083
single_primer2     198589
no_primers         192723  (5% of total simplex reads)
double_primer2     143854
primer1_primer2             89884

tijyojwad commented 5 months ago

Hi @ErminZ - dorado duplex will not trim anything from the simplex reads (either primers or barcodes). So all the simplex reads should be "good" quality (I would probably also filter based on the mean q score qs:i just to be sure). Whether or not they are usable for you pipeline depends on what characteristic of the reads you're looking for, but in general I wouldn't say there's anything wrong with them.

nanoporetech / dorado

PCR primers trimming mechanism in Dorado duplex calling #683

Reads still contain PCR primers after the default base-calling trimming

PCR primers in most reads in amplicon-based DNA sequencing

Run environment: