nanoporetech / dorado

Oxford Nanopore's Basecaller
https://nanoporetech.com/
Other
487 stars 59 forks source link

Missing data chrM Dorado #581

Closed Macdot3 closed 7 months ago

Macdot3 commented 8 months ago

Hi everyone! I'm using Dorado for the first time to identify the presence of methylated bases on mitochondrial DNA. After having installed and correctly used the modkit tool to create the bedMethyl file, we realized that the data referring to chrM was missing from the file. This aspect was also confirmed by the alignment, using the aligner function of dorado no reads map to the chrM, the fastq files of the same sample however give a coverage on the chrM of around 80-90x on average when normally aligned with minimap2 or ngmlr. From the original pod5 it seems that many aligned reads are lost and the chrM is essentially not covered by any sequence. Below, I'm writing the commands I've used.

dorado basecaller dna_r10.4.1_e8.2_400bps_hac@v4.1.0/ /home.../POD_5/ --modified-bases 5mCG_5hmCG --device cpu > /calls_U.bam samtools fastq -T'*' /calls_U.bam > /calls_Ufast.fastq /minimap2 -ax map-ont -y -t 10 /datasets/genomes/hg38/hg38.fa /calls_Ufast.fastq > /calls_U_alignedfast.sam samtools view -bS /calls_U_alignedfast.sam > /calls_U_alignedfast_1.bam samtools sort /calls_U_alignedfast_1.bam -o /calls_U_alignedfast_1_sorted.bam samtools index /calls_U_alignedfast_1_sorted.bam modkit pileup /calls_U_alignedfest_1_sorted.bam /calls_U_alignedfest_1_sorted.bam_pileup.bed --log-filepath pileup.log

Is anyone else working on the same project and can help me? Thank you very much.

vellamike commented 8 months ago

Would it be possible to share a POD5 so that we could reproduce this issue internally?

Macdot3 commented 8 months ago

Can I send you a personal email with the link to the files?

vellamike commented 8 months ago

Hi, yes, our Customer Services will reach out to you with my email address.

Macdot3 commented 8 months ago

Thank you so much, Mike. Currently, our research institute has been in contact with the Nanopore customer service. We have forwarded the folder containing the files to them, and it will be directed to you. I am eagerly awaiting your response. Best regards! :)

tijyojwad commented 8 months ago

Hi @Macdot3 - can you share which version of dorado you're running?

Macdot3 commented 8 months ago

v 0.4.1

tijyojwad commented 8 months ago

thank you! I tried with v0.5.2 and I was able to get the same results as minimap2. I think we may have fixed some bugs along the way. Would you be able to try the latest build from https://github.com/nanoporetech/dorado#installation ?

Macdot3 commented 7 months ago

Hi Joyjit, I just downloaded and installed version 0.5.2. Despite this, during the workflow I obtained the same result as the version I used before. The loss of aligned reads remains...

tijyojwad commented 7 months ago

Hi @Macdot3 thanks for testing that out. Here are the steps I ran on the dataset you shared with us -

$ /dorado-0.5.2/bin/dorado basecaller models/dna_r10.4.1_e8.2_400bps_hac\@v4.1.0 ~/POD5/ --modified-bases 5mCG_5hmCG -x "cuda:0"  > chrM.bam
[2024-01-23 14:16:26.433] [info] > Creating basecall pipeline
[2024-01-23 14:16:39.056] [info]  - set batch size for cuda:0 to 4608
[2024-01-23 14:17:36.590] [info] > Simplex reads basecalled: 53755
[2024-01-23 14:17:36.590] [info] > Basecalled @ Samples/s: 2.772567e+07
[2024-01-23 14:17:36.682] [info] > Finished
$ /dorado/dorado-0.5.2/bin/dorado aligner /references/human_GRCh38/grch38.fna.gz chrM.bam > chrM.aligned.bam
[2024-01-23 14:32:55.254] [info] > loading index /references/human_GRCh38/grch38.fna.gz
[2024-01-23 14:34:01.462] [info] > starting alignment
[2024-01-23 14:34:04.170] [info] > Simplex reads basecalled: 53755
[2024-01-23 14:34:04.170] [info] > finished alignment
[2024-01-23 14:34:04.170] [info] > total/primary/unmapped 54364/232/53523
$ samtools flagstats chrM.aligned.bam 
54364 + 0 in total (QC-passed reads + QC-failed reads)
53755 + 0 primary
488 + 0 secondary
121 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
841 + 0 mapped (1.55% : N/A)
232 + 0 primary mapped (0.43% : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

can you confirm that the issue you were seeing was no/very few primary alignments?

Macdot3 commented 7 months ago

Yes exactly, I got: 918 + 0 mapped (1.69% : N/A) with dorado aligner option and 835 + 0 mapped (1.54% : N/A) with minimap2 in both of dorado's release i used. When I run samtools idxstats I observe that for chrM I get 0 aligned reads and very few other reads on the rest of the genome.

tijyojwad commented 7 months ago

Thanks for confirming! We'll take a closer look. Since both minimap2 and dorado aligner are giving low mapped alignments, and I'm getting a similar result as you with GPU basecalling, it may have something to do with the raw data itself.

tijyojwad commented 7 months ago

This aspect was also confirmed by the alignment, using the aligner function of dorado no reads map to the chrM, the fastq files of the same sample however give a coverage on the chrM of around 80-90x on average when normally aligned with minimap2 or ngmlr.

Can you help me understand this statement? Where did the fastq files of the same sample come from? Were they also basecalled with dorado?

Macdot3 commented 7 months ago

Regarding the statement you highlighted we started from fastq files generated by MinKnow software (as the fast5) and we aligned those fatsq files with the following minimap2 command: minimap2 -a -x splice /genomes/hg38.fa sample.fastq.gz > sample_hg38.sam. (we shared the fastq files with the italian Nanoporetech team) If you need other info we are happy to share them.

tijyojwad commented 7 months ago

Thanks! I'll follow up with the team internally. I want to compare the original fastq file you got from MinKnow vs the reads being basecalled from the pod5 you shared. Sounds like the issue lies there (and not in alignment necessarily)

tijyojwad commented 7 months ago

Hi @Macdot3 - I received the original fastq and fast5 files as well. A few observations -

  1. I compared the read ids between the fast5 and the original set of pod5s you had shared. None of the read ids match. I thought they were supposed to be the same dataset?

  2. I basecalled the fast5 with dorado v0.5.2 with the latest hac@v4.3.0 model and 5mCG_5hmCG mods and hg38 reference

    $ dorado basecaller models/dna_r10.4.1_e8.2_400bps_hac\@v4.3.0 /fast5/ --modified-bases 5mCG_5hmCG -x "cuda:0" references/gca_000001405.15_grch38_no_alt_analysis_set/GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta.gz > calls.bam
    [2024-01-26 17:39:20.213] [info] > Creating basecall pipeline
    [2024-01-26 17:39:32.156] [info]  - set batch size for cuda:0 to 5760
    [2024-01-26 17:41:13.365] [info] > Simplex reads basecalled: 69809
    [2024-01-26 17:41:13.365] [info] > Simplex reads filtered: 3
    [2024-01-26 17:41:13.365] [info] > Basecalled @ Samples/s: 5.724425e+07
    [2024-01-26 17:41:14.590] [info] > Finished

    and flagstats reports this

    $ samtools flagstats calls.bam
    90688 + 0 in total (QC-passed reads + QC-failed reads)
    70928 + 0 primary
    10411 + 0 secondary
    9349 + 0 supplementary
    0 + 0 duplicates
    0 + 0 primary duplicates
    90550 + 0 mapped (99.85% : N/A)
    70790 + 0 primary mapped (99.81% : N/A)
    0 + 0 paired in sequencing
    0 + 0 read1
    0 + 0 read2
    0 + 0 properly paired (N/A : N/A)
    0 + 0 with itself and mate mapped
    0 + 0 singletons (N/A : N/A)
    0 + 0 with mate mapped to a different chr
    0 + 0 with mate mapped to a different chr (mapQ>=5)
  3. I merged the original fastqs from you and aligned against the same reference (using dorado aligner) and found

    $ samtools flagstats orig-fastq-aligned.bam 
    97056 + 0 in total (QC-passed reads + QC-failed reads)
    76008 + 0 primary
    11209 + 0 secondary
    9839 + 0 supplementary
    0 + 0 duplicates
    0 + 0 primary duplicates
    96794 + 0 mapped (99.73% : N/A)
    75746 + 0 primary mapped (99.66% : N/A)
    0 + 0 paired in sequencing
    0 + 0 read1
    0 + 0 read2
    0 + 0 properly paired (N/A : N/A)
    0 + 0 with itself and mate mapped
    0 + 0 singletons (N/A : N/A)
    0 + 0 with mate mapped to a different chr
    0 + 0 with mate mapped to a different chr (mapQ>=5)

So since

  1. the original fast5s basecall properly and match the mapping results you see with the fastq (99.85% mapped vs 99.73%)
  2. the original fast5 and the POD5 dataset you shared don't have any overlapping read ids, so they come from different sources.

I don't think the basecaller or aligner software is the issue. Your samples seem quite different - could it be a sample prep issue/dataset mixup or something else upstream of basecalling?

Macdot3 commented 7 months ago

Hi Joyjit, thank you so much for your feedback. By checking upstream during the sequencing process, I was provided with data that did not correspond to the desired fast5 and this created problems during subsequent conversions to pod5 files. In fact I tested the pipeline with another independent set of data and everything works with the new version. I also got an alignment percentage of 99.89%. I will continue to use this one. I also take this opportunity to ask for advice on which tool in your opinion is more complete for analyzing the bed file that I get from modkit. Which one between Loreme, pycoMeth, NanoMethViz? Thank you very much for your support and I will not hesitate to contact you again if there are any problems.

tijyojwad commented 7 months ago

Hi @Macdot3 glad to hear it got resolved!

which tool in your opinion is more complete for analyzing the bed file that I get from modkit. Which one between Loreme, pycoMeth, NanoMethViz?

Unfortunately I don't have any suggestions on that front since I don't work actively with bed files. I checked with the mod kit repo authors and they don't have any specific recommendations either

ArtRand commented 7 months ago

Hello @Macdot3,

I'm afraid I don't have a recommendation for one visualization package over another (I personally like the plots from methylartist). I do have a few bits of advice however.

  1. For visualizing the reads on IGV, you can use modkit call-mods, this command will filter out the low-confidence modification calls and change the passing ones into binary "calls" which can make them easier to see.

  2. You're probably aware of this already but, you can convert the bedMethyl into a bedGraph which is a little more natural to look at on IGV.

  3. For any of these tools, I recommend pre-processing the data with (1) first and/or making sure the reads are base-called with --modified-bases-threshold 0.0. Not all programs will correctly handle "implicit canonical bases" (sam spec, see section 1.7) and the results may be misleading. A "hack" is to run modkit adjust-mods --edge-filter 0..

I'm actively looking for better ways to help visualization/immediate secondary analysis - so feel free to open an issue on the modkit repo if you run into any problems. Good luck!