methylation in PCR samples (not methylated)

BioRB commented 1 year ago

Hello, we are running dorado on 2 samples. One native DNA vs the PCR amplified one that "in theory" should not hold the methylation. Why do we get a methylation signal in the PCR sample? The methylation signal is even higher in the PCR compared to the native DNA sample. We have got the same result also with other tools thus my suspect is that there is an issue with the wet part of the experiment. Then I would like to hear your thoughts before drawing my conclusions. Thanks. best, RB

ArtRand commented 1 year ago

Hello @BioRB ,

Which methylation model are you using (e.g. 5mCG, 5hmCG, 6mA)?

BioRB commented 1 year ago

dear developers, First I used Dorado + modbam2bed and I got those not realistic results. Now I tried with Dorado + modkit and I have better results. The problem is that my results using Dorado + modbam2bad overlap with the results obtained using nanopolish or mcaller (used for different methylations). Thus now I don't know what I have to trust, even if dorado+modkit produces more "realistic " results. I would like to believe that the latest tools (dorado+modkit) are better and give more reliable results .... but in general, we are a bit skeptical because of the huge differences in output using all these different tools. I used an old chemistry flow cell (9.4). I would like your point on it. thanks for your kind help! best rb

PS I used different models for dorado without any improvements

BioRB commented 1 year ago

I have an update. I did a comparison between dorado + modkit and nanopolish for 5mC and I got comparable results. the biggest difference I think is in the threshold to consider to filter the most relevant data. From the really first output from the 2 pipes we have different numbers of sites identified even if the sites overlap (more or less). Thus I can conclude that these new tools ar ok. The only thing I would suggest is to avoid modbam2bed and going directly with modkit. Thanks for the support. best RB

BioRB commented 1 year ago

After a new attempt, I can confirm that Dorado ( like the tools ) produces methylation signals in PCR-amplified samples. The amount of methylation detected is not compatible with background noise that can be taken into account usually. I want your comment on it. Moreover, could you please provide me a dataset of NOT methylated samples to be tested, to reproduce this bias?

Toinax commented 1 year ago

Hello @BioRB.

Did you use the --modified-base option of Dorado? Or just a simplex basecalling ?

melachl commented 9 months ago

I also have the same problem with dorado+modkit. I get modification calles in PCR products like in PCR-free samples. I used the 10.4.1 chemistry and basecalled the dat in simplex mode without modified-base option.

ArtRand commented 9 months ago

Hello @BioRB and @melachl,

Sorry for being slow to circle back. Could you tell me (again) which model(s) you're using and on what samples? We've tested the models on PCR samples internally and find the false positive rate to be quite low, but not zero. Happy to help.

A

melachl commented 9 months ago

Thank you for your reply! I used dorado basecaller hac,6mA pod5s/ > calls.bam (so this model was used: dna_r10.4.1_e8.2_400bps_hac@v4.3.0). I looked at mitochondrial DNA from isolated mitochondria of patient skin fibroblasts. I compared sequencing data of sequenced mitochondrial DNA (without PCR amplification) linearized by an Cas9 approach with amplified mitochondrial DNA by long range PCR. The called modifications were nearly identical in both samples.

ArtRand commented 9 months ago

@melachl,

If you run modkit summary $modbam --no-sampling --region chrM on your modBAM, what level of 6mA is predicted in the two samples? My guess is the level of 6mA is within the error of the model. Also the dna_r10.4.1_e8.2_400bps_sup@v4.3.0_6mA@v2 6mA model (coupled with the SUP basecaller) has the lowest false positive rate. Maybe take the reads aligned to the mitochondrial genome and re-basecall them with this model?

melachl commented 8 months ago

Sorry for the delay. Thank you for your suggestions. I ran modkit summary $modbam --no-sampling --region chrM and as you can see below (2 different samples for native mtDNA and PCR amplified DNA) in the case of methylation on C their is a nice difference between PCR product and native mtDNA but in the case of methylation on A their is no difference. Based on that I do not think that it is within the error of the model.

Native mitochondrial DNA:

modkit summary sorted_mod_SZG122_MitoIso.bam --no-sampling –region chrM not subsampling, using all reads calculating threshold at 10% percentile calculated thresholds: C: 0.50390625 A: 0.6503906 bases C,A total_reads_used 91657 count_reads_C 88984 count_reads_A 91586 pass_threshold_C 0.50390625 pass_threshold_A 0.6503906 base code pass_count pass_frac all_count all_frac C h 682549 0.08602001 953151 0.10815052 C - 5750072 0.7246678 6081571 0.6900534 C m 1502149 0.18931223 1778467 0.20179608 A a 6083958 0.08939469 9350814 0.123836435 A - 61973309 0.9106053 66158576 0.87616354

Native mitochondrial DNA:

modkit summary sorted_SZG146_Barcode17.bam --no-sampling –region chrM not subsampling, using all reads calculating threshold at 10% percentile calculated thresholds: A: 0.6503906 C: 0.7832031 bases A,C total_reads_used 65644 count_reads_A 65644 count_reads_C 65341 pass_threshold_C 0.7832031 pass_threshold_A 0.6503906 base code pass_count pass_frac all_count all_frac C m 1679601 0.39676115 1913738 0.40697646 C h 57661 0.01362088 174421 0.037092455 C - 2496018 0.58961797 2614172 0.5559311 A - 49391795 0.90978444 52765065 0.8764003 A a 4897766 0.090215616 7441514 0.12359968

PCR amplified mitochondrial DNA:

modkit summary sorted_FAX76760_NB17_Blut_SZG5091.bam --no-sampling --region chrM not subsampling, using all reads calculating threshold at 10% percentile calculated thresholds: A: 0.6738281 C: 0.7265625 bases A,C total_reads_used 21985 count_reads_C 21980 count_reads_A 21985 pass_threshold_A 0.6738281 pass_threshold_C 0.7265625 region chrM:0-16569 base code pass_count pass_frac all_count all_frac C m 13152 0.005394336 35567 0.0131578315 C - 2402879 0.9855487 2595382 0.9601484 C h 22082 0.009057004 72156 0.026693746 A a 1953936 0.073953964 3172285 0.10814216 A - 24467040 0.926046 26162112 0.89185786

PCR amplified mitochondrial DNA:

modkit summary sorted_FAX76760_NB19_SZG5231.bam --no-sampling --region chrM not subsampling, using all reads calculating threshold at 10% percentile calculated thresholds: C: 0.7265625 A: 0.6699219 bases C,A total_reads_used 8512 count_reads_A 8512 count_reads_C 8508 pass_threshold_C 0.7265625 pass_threshold_A 0.6699219 region chrM:0-16569 base code pass_count pass_frac all_count all_frac C - 1188639 0.98487115 1284217 0.95880663 C h 11384 0.009432446 36383 0.027163837 C m 6875 0.005696422 18791 0.01402951 A a 1016427 0.0768435 1629063 0.110955946 A - 12210807 0.9231565 13053007 0.88904405

ArtRand commented 8 months ago

Hello @melachl,

Thanks for these numbers. I'm going to run some tests internally and compare. Just to confirm, you're using the SUP basecaller model and dna_r10.4.1_e8.2_400bps_sup@v4.3.0_6mA@v2 6mA model, correct?

melachl commented 8 months ago

No I used hac (dna_r10.4.1_e8.2_400bps_hac@v4.3.0) because for sup I only got a few reads per barcode and I don't know why. But in the case for methylation at C there is a clear difference between PCR amplicons and native DNA when using hac so I thought the accuracy should also be enough for 6mA.

ArtRand commented 8 months ago

Hello @melachl,

Sorry for the delay. I ran a test on some PCR and native DNA. I used summary to get an idea of the changes in 6mA in PCR versus native reads:

The command template is:

$ modkit summary ${merged_sorted_bam} --no-sampling --only-mapped

PCR dna_r10.4.1_e8.2_400bps_sup@v4.3.0 + dna_r10.4.1_e8.2_400bps_sup@v4.3.0_6mA@v2

# bases             C,A 
# total_reads_used  793 
# count_reads_A     793 
# count_reads_C     792 
# pass_threshold_A  0.7871094 
# pass_threshold_C  0.76171875 
base  code  pass_count  pass_frac     all_count  all_frac 
A     -     494323      0.9833575     534270     0.95895624 
A     a     8366        0.016642496   22867      0.041043766 
C     -     45847       0.99598104    50235      0.9852511 
C     h     101         0.002194126   500        0.009806422 
C     m     84          0.0018248175  252        0.0049424362

NATIVE dna_r10.4.1_e8.2_400bps_sup@v4.3.0 + dna_r10.4.1_e8.2_400bps_sup@v4.3.0_6mA@v2

# bases             C,A 
# total_reads_used  362 
# count_reads_A     362 
# count_reads_C     361 
# pass_threshold_C  0.6386719 
# pass_threshold_A  0.6347656 
base  code  pass_count  pass_frac    all_count  all_frac 
C     h     773         0.026936613  1538       0.04827067 
C     -     25443       0.88660836   27093      0.85032326 
C     m     2481        0.08645503   3231       0.10140606 
A     a     41499       0.08555031   63401      0.117845945 
A     -     443584      0.9144497    474598     0.88215405

Using the HAC basecalls and the HAC v2 6mA model:

PCR dna_r10.4.1_e8.2_400bps_hac@v4.3.0 + dna_r10.4.1_e8.2_400bps_hac@v4.3.0_6mA@v2

# bases             C,A 
# total_reads_used  788 
# count_reads_A     788 
# count_reads_C     785 
# pass_threshold_A  0.7285156 
# pass_threshold_C  0.921875 
base  code  pass_count  pass_frac     all_count  all_frac 
C     -     45531       0.9977211     49702      0.98083794 
C     h     68          0.0014900843  594        0.011722219 
C     m     36          0.0007888682  377        0.0074398597 
A     a     17793       0.03574478    36288      0.065662764 
A     -     479986      0.9642552     516354     0.93433726

NATIVE dna_r10.4.1_e8.2_400bps_hac@v4.3.0 + dna_r10.4.1_e8.2_400bps_hac@v4.3.0_6mA@v2

# bases             A,C 
# total_reads_used  349 
# count_reads_C     346 
# count_reads_A     349 
# pass_threshold_C  0.6779299 
# pass_threshold_A  0.6191406 
base  code  pass_count  pass_frac    all_count  all_frac 
A     a     51301       0.11001338   72607      0.14067034 
A     -     415015      0.88998663   443543     0.85932964 
C     m     2414        0.071928725  3387       0.090826206 
C     h     1071        0.03191204   2110       0.05658202 
C     -     30076       0.89615923   31794      0.85259175

tl;dr

Use the 6mA@v2 6mA model.
I'm seeing about an 8% change in PCR vs native samples in the highest performance (SUP) combination.
There is a ~8% FPR with SUP basecalls and 11% FPR with HAC, we expect about 4% FPR with SUP so I'm investigating this further.
You should see a larger difference between PCR and native mtDNA than you're seeing. I'm also concerned that you're not able to use the SUP basecaller model, so you might want to dig into what is going on there.

melachl commented 8 months ago

Okey! Thank you very much!

I will try that out!

It is a little bit strange for the basecalling with sup because it is multiplexed that with a few houndred reads per barcode but after basecalling with sup there are only 7 reads per barcode left....

tijyojwad commented 8 months ago

Hi @melachl can you share details of how you demultiplexed the data?

melachl commented 7 months ago

I did it using this command: $ dorado demux --kit-name --output-dir

tijyojwad commented 7 months ago

thanks!

can you try to run the basecaller command with --no-trim and then demux?

alternatively, you can pass the --kit-name to the dorado basecaller cmd to classify in line with basecalling and then use dorado demux --no-classify --ouptut-dir to split into subdirectories?

nanoporetech / dorado

methylation in PCR samples (not methylated) #391