Using Dorado for mycobacterial Data.

AzlanNI commented 6 months ago

Hello everyone!

We were using Tombo for our Methylation Analysis of bacterial DNA data. Since Tombo is deprecated we changed to dorado. I now analyzed some samples and got some interesting results. The C Positions were detected quite good but the methylation was not as good i would say as in Tombo.

So i got 202.219 C Positions (forward and reverse) all of these Positions were found by dorado and have a methylation value. Out of these 202.219 Positions 85.324 have a Meth Score of 0. So 0 Reads were detected with a methylated Base at these positions. This made me suspicious since it is almost 50% of my Sites. So i looked it up in old Tombo Results and the .wig files. In the Tombo Results i had only 900 Sites with a methylation Score of 0. which was like 0.46% of all Sites detected. After that i tried to understand why i am seeing this i looked at Sites where the Mod_Pos was 0 and the filtered were != 0 and this was about 42.830 Sites. So is it possible that the Sites just had a an ambiguous methylation state with dorado and Tombo called the modification better ? I will also print the commands i am using maybe Tombo had a different threshold and dorado is stricter ? If yes could u tell me what the threshold should be since i am running everything on default but don't want 50% of my Sites to have a 0 methylation cause of thresholds. Also i am using modbam2bed to generate a .bed file from the .bam file maybe it is the threshold in there ? I also checked the .wig Results from Tombo if i look at Sites where the number of methylated Reads is < 0.1 % i get the same Result as with Dorado as in 50% of my Sites have a 0% methylation. I hope someone can help me out! Commands: Tombo 1.5: tombo detect_modifications alternative_model \ --fast5-basedirs /gpfs/project/dilthey/projects/HematoDETECT/MethylationCfDNA/MycoData/Nanoporedata/2539/ \ --statistics-file-basename alternative_testing_2539 \ --alternate-bases dam dcm 5mC 6mA \ --processes 8

Dorado 0.5.1: PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:25 dorado basecaller --modified-bases-models /software/dorado/0.5.1/models/dna_r10.4.1_e8.2_400bps_sup@v4.2.0_5mC@v2,/software/dorado/0.5.1/models/dna_r10.4.1_e8.2_400bps_sup@v4.2.0_6mA@v3 /software/dorado/0.5.1/models/dna_r10.4.1_e8.2_400bps_sup@v4.2.0 /gpfs/project/azlan/Myco_R10/barcode04_Pods --reference /gpfs/project/azlan/Myco_R10/ref/SP2565_R10IIPC_n.fasta> SP2565_Aligned.bam

Modbam2bed: for i in 5mC 6mA do ./modbam2bed -e -m $i -t 5 \ SP2565_R10IIPC_n.fasta \ /gpfs/project/azlan/Myco_R10/SP2565_pod5/Output_pods/SP2565_sorted_aligned.bam \

/gpfs/project/projects/DiltheyLab/MethylationCfDNA/MycoR10/MycoR10/SP2565${i}.bed done

kind regards, Azlan

vellamike commented 6 months ago

Hi @AzlanNI,

We are discussing this internally and will reply to you in detail soon.

Mike

AzlanNI commented 6 months ago

@vellamike Thank u very much!

marcus1487 commented 6 months ago

There are a number of potential different issues here. I would start out by stating that Tombo and Dorado work in very different ways, so drastically different results would not be unexpected. Tombo is a more research based approach here using per-base metrics derived from the signal. Because of this, the full information from the signal is lost at the time of modeling. This was one of the main reasons to move to neural network based modified base calling using Remora models within the Dorado implementation. For this reason I would expect that the Dorado based modified base calls would be of much higher accuracy overall.

This being said there are a number of issues that may be contributing to the observations here. In the Tombo command you show that you are using the dam and dcm model. This will only make calls at the dam (GATC) and dcm (CCWGG) motifs. So you may be observing the simple difference between making modified base calls at specific sequence motifs within Tombo versus all-contexts in Dorado. If you constrict your calls to the specified motifs (using our modkit tool) you may find that the results are quite similar. I would highly advise that you upgrade from modbam2bed to modkit.

As a related issue, using 0% methylation as the metric for the "number of modified sites" is not advised. We strive to make the modified base models as accurate as possible, but the models do have an error rate. Thus given sufficient depth I would expect that every position would show some level of called methylation. We have discussed the modified base model error rates in several previous modified base presentations over the past several years. I would advise reviewing these for our observed error rates on ground truth strands. Setting the level of methylation relevant becomes a question about your larger biological question as well as the observed error rate.

I hope this gives you some threads to pull on to diagnose your issues, but please post if you have any further questions.

AzlanNI commented 6 months ago

@marcus1487 first of all thanks for the reply! Yeah i use dcm and dam calling but also 5mC and 6mA over the whole reference genome. For Sites where Tombo had a methylation value of >0 dorado shows me a 0 value. But maybe it is connected to modbam2bed i will try out Modkit. Another question would be if merging bam files results in an issue ? Since i am merging the .bam files into one merged file with SamTools 1.11 have u heared of the issue that SamTools versions <1.14 cant merge the MM and ML tag properly ?

AzlanNI commented 6 months ago

but my commands look correct right ? cause i cant really understand why there is such a difference and so many sites without any methylation its like 40% higher then in Tombo

ArtRand commented 5 months ago

Hello @AzlanNI,

As @marcus1487 mentioned, the two modeling approaches can lead to different results. Could you tell me how you're tabulating the number of positions "with a methylation value"? Is it counting the number of rows in the bedMethyl that have a percent modification >0? You may candider modifying your dorado command like so:

PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:25 dorado basecaller \
  --modified-bases-models /software/dorado/0.5.1/models/dna_r10.4.1_e8.2_400bps_sup@v4.2.0_5mC@v2,/software/dorado/0.5.1/models/dna_r10.4.1_e8.2_400bps_sup@v4.2.0_6mA@v3 \
  /software/dorado/0.5.1/models/dna_r10.4.1_e8.2_400bps_sup@v4.2.0 \
  /gpfs/project/azlan/Myco_R10/barcode04_Pods \
  --reference /gpfs/project/azlan/Myco_R10/ref/SP2565_R10IIPC_n.fasta\
  --modified-bases-threshold 0.0 \  # <-- Add this 
  > SP2565_Aligned.bam

the --modified-bases-threshold setting to 0.0 will make dorado output a probability for each position (by default it will zero-out all modification probabilities <0.05). In general the method for counting the number of modified positions with modkit is like so:

$ modkit pileup ${modbam} ${bedmethyl_pileup} \
  --motif GATC 1 \
  --motif CCWGG 1 \
  --log-filepath modkit_pileup.log

then count the number of rows with a percent_modified greater than a threshold value

min_cov=5
modified_threshold=50

# calculate the number of positions with 6mA-modification
$ awk -v min_cov=$min_cov -v mod_thresh=$modified_threshold \
  '{if (($4=="a") && ($10>=min_cov) && ($11>mod_thresh)) {i+=1}} END{print i}' \
  ${bedmethyl_pileup}

# calculate the number of positions with 5mC-modification
$ awk -v min_cov=$min_cov -v mod_thresh=$modified_threshold \
  '{if (($4=="m") && ($10>=min_cov) && ($11>mod_thresh)) {i+=1}} END{print i}' \
  ${bedmethyl_pileup}

What I'm doing here is producing a bedMethyl pileup where the pass-threshold will be determined on the fly, and specifying that the table should only look at two motifs. Then I'm count the number of positions with modification percent greater than 50% and at least 5 reads of pass coverage.

To get the total number of sites for each motif, by the way, you can use modkit as well:

$ modkit motif-bed ${ref} GATC 1 | wc -l 
$ modkit motif-bed ${ref} CCWGG 1 | wc -l

Lastly, I have not heard about Samtools being unable to merge BAM files with MM/ML tags (assuming you mean samtools merge).

Hope this helps.

AzlanNI commented 5 months ago

Hey @ArtRand! First of all Thanks for ur detailed response! Yes so basically i generate a Bam file with dorado and then turn it into a Bed file with Modkit. After that i have some Motif driven analysis for my Data. Before i do the Motif driven analysis i look into the Bed file and the genome to check how many Cs or As i have in my genome and then i check for how many of them i have a Me value. What puzzled me was the difference between dorado and Tombo. That this many Sites had a value of 0. I am currently analyzing a new set of Data and will do the same analysis for it. And give feedback after that i could also try the --modified-bases-threshold 0.0. Since it could be that the Positions just have a really low Me percentage which are being zero-out. I also saw some Major differences between specific Motifs where i had for the same Data a Me percentage of 30% with Tombo and one of 7% with dorado. But i understood that dorado is more up2date and more accurate. With Modkit i have also Threshold which are given out in command prompt which are >0.9 for 5mC and 6mA. Would u suggest to change there the Thresholds as well ?

Thanks again! Kind regards, Azlan

ArtRand commented 5 months ago

@AzlanNI

Would u suggest to change there the Thresholds as well ?

We recommend letting modkit estimate the pass threshold automatically. If you are still confused by the results make sure to report the estimated threshold from the log file (with --log-filepath ${log_file}) produced by modkit.

You may be interested in the motif detection algorithm in modkit as well.

vellamike commented 5 months ago

Closing this issue as the main question has been answered.

nanoporetech / dorado

Using Dorado for mycobacterial Data. #762