nanoporetech / modkit

A bioinformatics tool for working with modified bases
https://nanoporetech.com/
Other
121 stars 7 forks source link

Much lower pass rate of 5hmC calls than 5mC calls #222

Open hassebossenbroek opened 3 weeks ago

hassebossenbroek commented 3 weeks ago

Hi,

I've run modkit summary on some modified bam files (5mC and 5hmC calling). Base calling was done using Dorado 0.7.0 using model dna_r10.4.1_e8.2_400bps_sup@v4.3.0_5mC_5hmC@v1. I've used modkit summary with default settings, but I've noticed that far fewer of the 5hmC calls pass the threshold compared to the 5mC calls. I have four samples with very similar values (modkit summary output from a representative one below). On average, 82% of 5mC calls pass, whereas only 36% of 5hmC calls pass. That discrepancy seems very large to me, but this is my first methylation dataset, so I'm not sure what to expect and I'm finding it hard to find any information about this. Could anyone comment on what could be causing this/whether it is expected/whether I should change any of my settings to prevent this from happening?

Many thanks!

# bases             C
# total_reads_used  10042
# count_reads_C     10042
# pass_threshold_C  0.64453125
 base  code  pass_count  pass_frac    all_count  all_frac
 C     m     1198287     0.056338448  1635354    0.0692229
 C     -     19929750    0.93701357   21571243   0.91308916
 C     h     141399      0.00664799   417869     0.017687976
ArtRand commented 3 weeks ago

Hello @hassebossenbroek,

Would you expect there to be any 5hmC in your sample? It is possible that if you don't have any of a specific base modification (5hmC) that the only calls for that modification will be very low confidence, ambiguous positions.

hassebossenbroek commented 3 weeks ago

Hi Arthur,

Thank you for your reply!

Yes, I would expect there to be 5hmC - this is a human cancer cell line. Weirdly, one of my other samples has a mutation in TET2 and would therefore be expected to have lower 5hmC and higher 5mC, but the amounts of 5mC and 5hmC seem to be very comparable in all samples.

I know that two of my samples have lower than expected coverage due to technical issues, but again, the amounts of 5hmC/5mC look similar across all samples, so I didn't think this would be the problem unless all samples were massively undersequenced.

Do you have a reference anywhere for the amount of 5mC/5hmC that would be expected for a human genome?

Thank you!

Manoswini-02 commented 3 weeks ago

Hi @hassebossenbroek

On average, 82% of 5mC calls pass, whereas only 36% of 5hmC calls pass.

Is this data from all samples? The statistics provided for a representative sample indicate that most bases are unmodified (93%), with '-' representing unmodified cytosines. Is this interpretation correct?

I've observed a consistent pattern across all my samples where 5hmC levels are significantly lower compared to 5mC, with the order being 5mC > unmodified > 5hmC. One of my representative modkit summaries (applied to all reads without sub-sampling) has shown the following

> calculating threshold at 10% percentile
> calculated thresholds: C: 0.6875
 bases             C
 total_reads_used  6773225
 count_reads_C     6773225
 pass_threshold_C  0.6875
 base  code  pass_count  pass_frac    all_count  all_frac
 C     h     5617682     0.015102832  12738124   0.030843219
 C     m     284749162   0.76553255   300833476  0.7284175
 C     -     81595336    0.2193646    99424325   0.24073924

I also have a question regarding the low coverage for 5hmC calls; is it because of lower depth? Is this coverage sufficient to reliably identify Differentially hydroxyMethylated Regions (DhMRs) based on these values? Would love to hear back from you and others.

Thanks!!

hassebossenbroek commented 3 weeks ago

Hi @Manoswini-02

Is this data from all samples? The statistics provided for a representative sample indicate that most bases are unmodified (93%), with '-' representing unmodified cytosines. Is this interpretation correct?

That interpretation is correct (which surprised me - but I'm having real trouble finding reference values for genome-wide 5mC coverage, so I don't know if it is as expected. All papers I've read so far only discuss 5mC levels in CpG contexts, which I assume are quite a lot higher than 5mC levels in any genomic context). Did you use a base caller that identifies 5hmCG and 5mCG (CpG dinucleotide contexts only) or 5hmC and 5mC (all genomic contexts)? I do believe that is likely to make a difference, but I'm not sure what levels of 5hmC are expected in either case.

When I said

On average, 82% of 5mC calls pass, whereas only 36% of 5hmC calls pass.

I was comparing the pass_count to all_count. For 5hmC in this particular sample, only 141399 5hmC calls passed the threshold, whilst 417869 were called - a pass rate of 33.8%.

ArtRand commented 3 weeks ago

Hello @hassebossenbroek,

From the reports I've seen, the levels of 5hmC in human samples is somewhere in the range of 1% or less (e.g. Foox et al.). If you're using the "all-context" model on human data, you'll get a modification call at every cytosine, most of which won't be in a CpG dinucleotide (my estimate on chr20 is 5% will be CpGs) and will (most likely) be canonical. Calling canonical cytosine as 5hmC is the most common error mode (one error mode has to be the most common), so most of the 5hmC calls are probably canonical cytosines being erroneously called as 5hmC at low confidence and filtered out. You may try using modkit pileup --motif CG 0 and modkit pileup --motif CH 0 and then calculate the fraction of 5hmC at CpG motifs and non-CpG motifs (the table schema is in the documentation). You could also look at the output probabilities with modkit sample-probs --hist and use a 5hmC specific threshold if you're concerned that the threshold is too high (although 0.644 is really not very high).