Closed BioRB closed 1 month ago
Hello @BioRB ,
Which methylation model are you using (e.g. 5mCG, 5hmCG, 6mA)?
dear developers, First I used Dorado + modbam2bed and I got those not realistic results. Now I tried with Dorado + modkit and I have better results. The problem is that my results using Dorado + modbam2bad overlap with the results obtained using nanopolish or mcaller (used for different methylations). Thus now I don't know what I have to trust, even if dorado+modkit produces more "realistic " results. I would like to believe that the latest tools (dorado+modkit) are better and give more reliable results .... but in general, we are a bit skeptical because of the huge differences in output using all these different tools. I used an old chemistry flow cell (9.4). I would like your point on it. thanks for your kind help! best rb
PS I used different models for dorado without any improvements
I have an update. I did a comparison between dorado + modkit and nanopolish for 5mC and I got comparable results. the biggest difference I think is in the threshold to consider to filter the most relevant data. From the really first output from the 2 pipes we have different numbers of sites identified even if the sites overlap (more or less). Thus I can conclude that these new tools ar ok. The only thing I would suggest is to avoid modbam2bed and going directly with modkit. Thanks for the support. best RB
After a new attempt, I can confirm that Dorado ( like the tools ) produces methylation signals in PCR-amplified samples. The amount of methylation detected is not compatible with background noise that can be taken into account usually. I want your comment on it. Moreover, could you please provide me a dataset of NOT methylated samples to be tested, to reproduce this bias?
Hello @BioRB.
Did you use the --modified-base option of Dorado? Or just a simplex basecalling ?
I also have the same problem with dorado+modkit. I get modification calles in PCR products like in PCR-free samples. I used the 10.4.1 chemistry and basecalled the dat in simplex mode without modified-base option.
Hello @BioRB and @melachl,
Sorry for being slow to circle back. Could you tell me (again) which model(s) you're using and on what samples? We've tested the models on PCR samples internally and find the false positive rate to be quite low, but not zero. Happy to help.
A
Thank you for your reply! I used dorado basecaller hac,6mA pod5s/ > calls.bam (so this model was used: dna_r10.4.1_e8.2_400bps_hac@v4.3.0). I looked at mitochondrial DNA from isolated mitochondria of patient skin fibroblasts. I compared sequencing data of sequenced mitochondrial DNA (without PCR amplification) linearized by an Cas9 approach with amplified mitochondrial DNA by long range PCR. The called modifications were nearly identical in both samples.
@melachl,
If you run modkit summary $modbam --no-sampling --region chrM
on your modBAM, what level of 6mA is predicted in the two samples? My guess is the level of 6mA is within the error of the model. Also the dna_r10.4.1_e8.2_400bps_sup@v4.3.0_6mA@v2
6mA model (coupled with the SUP basecaller) has the lowest false positive rate. Maybe take the reads aligned to the mitochondrial genome and re-basecall them with this model?
Sorry for the delay. Thank you for your suggestions. I ran modkit summary $modbam --no-sampling --region chrM and as you can see below (2 different samples for native mtDNA and PCR amplified DNA) in the case of methylation on C their is a nice difference between PCR product and native mtDNA but in the case of methylation on A their is no difference. Based on that I do not think that it is within the error of the model.
Native mitochondrial DNA:
modkit summary sorted_mod_SZG122_MitoIso.bam --no-sampling –region chrM not subsampling, using all reads calculating threshold at 10% percentile calculated thresholds: C: 0.50390625 A: 0.6503906 bases C,A total_reads_used 91657 count_reads_C 88984 count_reads_A 91586 pass_threshold_C 0.50390625 pass_threshold_A 0.6503906 base code pass_count pass_frac all_count all_frac C h 682549 0.08602001 953151 0.10815052 C - 5750072 0.7246678 6081571 0.6900534 C m 1502149 0.18931223 1778467 0.20179608 A a 6083958 0.08939469 9350814 0.123836435 A - 61973309 0.9106053 66158576 0.87616354
Native mitochondrial DNA:
modkit summary sorted_SZG146_Barcode17.bam --no-sampling –region chrM not subsampling, using all reads calculating threshold at 10% percentile calculated thresholds: A: 0.6503906 C: 0.7832031 bases A,C total_reads_used 65644 count_reads_A 65644 count_reads_C 65341 pass_threshold_C 0.7832031 pass_threshold_A 0.6503906 base code pass_count pass_frac all_count all_frac C m 1679601 0.39676115 1913738 0.40697646 C h 57661 0.01362088 174421 0.037092455 C - 2496018 0.58961797 2614172 0.5559311 A - 49391795 0.90978444 52765065 0.8764003 A a 4897766 0.090215616 7441514 0.12359968
PCR amplified mitochondrial DNA:
modkit summary sorted_FAX76760_NB17_Blut_SZG5091.bam --no-sampling --region chrM not subsampling, using all reads calculating threshold at 10% percentile calculated thresholds: A: 0.6738281 C: 0.7265625 bases A,C total_reads_used 21985 count_reads_C 21980 count_reads_A 21985 pass_threshold_A 0.6738281 pass_threshold_C 0.7265625 region chrM:0-16569 base code pass_count pass_frac all_count all_frac C m 13152 0.005394336 35567 0.0131578315 C - 2402879 0.9855487 2595382 0.9601484 C h 22082 0.009057004 72156 0.026693746 A a 1953936 0.073953964 3172285 0.10814216 A - 24467040 0.926046 26162112 0.89185786
PCR amplified mitochondrial DNA:
modkit summary sorted_FAX76760_NB19_SZG5231.bam --no-sampling --region chrM not subsampling, using all reads calculating threshold at 10% percentile calculated thresholds: C: 0.7265625 A: 0.6699219 bases C,A total_reads_used 8512 count_reads_A 8512 count_reads_C 8508 pass_threshold_C 0.7265625 pass_threshold_A 0.6699219 region chrM:0-16569 base code pass_count pass_frac all_count all_frac C - 1188639 0.98487115 1284217 0.95880663 C h 11384 0.009432446 36383 0.027163837 C m 6875 0.005696422 18791 0.01402951 A a 1016427 0.0768435 1629063 0.110955946 A - 12210807 0.9231565 13053007 0.88904405
Hello @melachl,
Thanks for these numbers. I'm going to run some tests internally and compare. Just to confirm, you're using the SUP basecaller model and dna_r10.4.1_e8.2_400bps_sup@v4.3.0_6mA@v2
6mA model, correct?
No I used hac (dna_r10.4.1_e8.2_400bps_hac@v4.3.0) because for sup I only got a few reads per barcode and I don't know why. But in the case for methylation at C there is a clear difference between PCR amplicons and native DNA when using hac so I thought the accuracy should also be enough for 6mA.
Hello @melachl,
Sorry for the delay. I ran a test on some PCR and native DNA. I used summary
to get an idea of the changes in 6mA in PCR versus native reads:
The command template is:
$ modkit summary ${merged_sorted_bam} --no-sampling --only-mapped
PCR dna_r10.4.1_e8.2_400bps_sup@v4.3.0 + dna_r10.4.1_e8.2_400bps_sup@v4.3.0_6mA@v2
# bases C,A
# total_reads_used 793
# count_reads_A 793
# count_reads_C 792
# pass_threshold_A 0.7871094
# pass_threshold_C 0.76171875
base code pass_count pass_frac all_count all_frac
A - 494323 0.9833575 534270 0.95895624
A a 8366 0.016642496 22867 0.041043766
C - 45847 0.99598104 50235 0.9852511
C h 101 0.002194126 500 0.009806422
C m 84 0.0018248175 252 0.0049424362
NATIVE dna_r10.4.1_e8.2_400bps_sup@v4.3.0 + dna_r10.4.1_e8.2_400bps_sup@v4.3.0_6mA@v2
# bases C,A
# total_reads_used 362
# count_reads_A 362
# count_reads_C 361
# pass_threshold_C 0.6386719
# pass_threshold_A 0.6347656
base code pass_count pass_frac all_count all_frac
C h 773 0.026936613 1538 0.04827067
C - 25443 0.88660836 27093 0.85032326
C m 2481 0.08645503 3231 0.10140606
A a 41499 0.08555031 63401 0.117845945
A - 443584 0.9144497 474598 0.88215405
Using the HAC basecalls and the HAC v2 6mA model:
PCR dna_r10.4.1_e8.2_400bps_hac@v4.3.0 + dna_r10.4.1_e8.2_400bps_hac@v4.3.0_6mA@v2
# bases C,A
# total_reads_used 788
# count_reads_A 788
# count_reads_C 785
# pass_threshold_A 0.7285156
# pass_threshold_C 0.921875
base code pass_count pass_frac all_count all_frac
C - 45531 0.9977211 49702 0.98083794
C h 68 0.0014900843 594 0.011722219
C m 36 0.0007888682 377 0.0074398597
A a 17793 0.03574478 36288 0.065662764
A - 479986 0.9642552 516354 0.93433726
NATIVE dna_r10.4.1_e8.2_400bps_hac@v4.3.0 + dna_r10.4.1_e8.2_400bps_hac@v4.3.0_6mA@v2
# bases A,C
# total_reads_used 349
# count_reads_C 346
# count_reads_A 349
# pass_threshold_C 0.6779299
# pass_threshold_A 0.6191406
base code pass_count pass_frac all_count all_frac
A a 51301 0.11001338 72607 0.14067034
A - 415015 0.88998663 443543 0.85932964
C m 2414 0.071928725 3387 0.090826206
C h 1071 0.03191204 2110 0.05658202
C - 30076 0.89615923 31794 0.85259175
tl;dr
6mA@v2
6mA model.Okey! Thank you very much!
I will try that out!
It is a little bit strange for the basecalling with sup because it is multiplexed that with a few houndred reads per barcode but after basecalling with sup there are only 7 reads per barcode left....
Hi @melachl can you share details of how you demultiplexed the data?
I did it using this command: $ dorado demux --kit-name
thanks!
can you try to run the basecaller
command with --no-trim
and then demux?
alternatively, you can pass the --kit-name
to the dorado basecaller
cmd to classify in line with basecalling and then use dorado demux --no-classify --ouptut-dir
to split into subdirectories?
Hello, we are running dorado on 2 samples. One native DNA vs the PCR amplified one that "in theory" should not hold the methylation. Why do we get a methylation signal in the PCR sample? The methylation signal is even higher in the PCR compared to the native DNA sample. We have got the same result also with other tools thus my suspect is that there is an issue with the wet part of the experiment. Then I would like to hear your thoughts before drawing my conclusions. Thanks. best, RB