hasindu2008 / f5c

Ultra-fast methylation calling and event alignment tool for nanopore sequencing data (supports CUDA acceleration)
https://hasindu2008.github.io/f5c/docs/overview
MIT License
143 stars 27 forks source link

When doing f5c call-methylation, the quality of dorado basecaller's reads was significantly lower than that of guppy basecaller's reads #158

Closed happier21 closed 4 months ago

happier21 commented 7 months ago

This is the full log of the f5c call-methylation: 1712653157837

This is dorado basecaller's order: dorado basecaller /share/home/yzwl_hanxs/app/dorado-0.5.3-linux-x64/model/dna_r10.4.1_e8.2_400bps_sup@v4.1.0 ./pod5/ | amtools view -bhS -@ 10 > test.bam Convert bam to fastq: samtools fastq -0 test.fastq test.bam Use minimap2 to align: minimap2 -a -x map-ont /share/home/yzwl_hanxs/refdata-gex-GRCh38-2020-A/fasta/genome.fa test.fastq | samtools sort -o test.sorted.bam -T test.tmp This is the log for minimap2: 1712653176862 Why does this happen Thank you, ShengquanWang

hasindu2008 commented 7 months ago

Could you see the mapq distribution of the results from minimap2 for guppy and dorado results separately?

I am trying to see if the mapqs are really low in the minimap2 result for dorado (likely) or whether f5c is not counting the mapq right.

An example bash one liner Samtools view file.bam | cut -f5 | sort | uniq -c | sort -nr -k1,1

happier21 commented 7 months ago

samtools view dorado.data.sorted.bam |cut -f 5|sort |uniq -c |sort -nr -k1,1

1712713376161

samtools view guppy.data.sorted.bam |cut -f 5|sort |uniq -c |sort -nr -k1,1

1712713501857

It seems that something went wrong when running dorado basecaller that caused mappqs to be too low

hasindu2008 commented 7 months ago

Yeh, it seems that f5c correctly reports that there are low mapq reads. If you used the same minimap2 command to map the reads, it is the basecaller that could be causing this.

Which model did you use for Guppy?

Is this 4KHz data or 5KHz data? Is the sequencing kit lsk114?

happier21 commented 7 months ago

I know the problem. My own data is r10, but the sample data I downloaded when exploring the process is r9. However, I have been using r10 model to basecalling, and I have changed to r9 model to run. It turns out to be right. Thank you very much for your help; Thank you very much for your prompt response to my questions each time