Difference in total counts of each probability per modification in different pore chemistries

eesiribloom commented 2 months ago

I have samples basecalled with dorado software (v0.4.1) with detection of 5mCG_5hmCG modifications enabled using either:

dna_r10.4.1_e8.2_400bps_hac@v4.2.0 dna_r9.4.1_e8_hac@v3.3

I've noticed the two "batches" have distinct distribution of probabilities in base modifications. In particular the R9.4.1 samples have a massive peak of C:m at the far right of the histogram which looks like some sort of artefact (second plot).

I'm wondering what the explanation for this would be and what is the best way to mitigate this issue?

command:

modkit sample-probs \
${input_bam} \
--log-filepath ${log} \
--percentiles 0.1,0.25,0.5,0.75,0.9 \
--out-dir ${output} \
--hist \
--prefix ${prefix} \
--suppress-progress \
--force
done

Counts (1) Counts

ArtRand commented 2 months ago

Hello @eesiribloom,

Just a sanity check, are these plots generated from similar samples (and similar alignments). What are you looking for in terms of a mitigation strategy?

The R10 models have a heavier tail as a result of more diverse examples used in training. In general, these models will have higher performance in more contexts. You should still be able to use the (default) estimated threshold calculated by modkit. If you want to use modification-specific thresholds, you should use a different value per-mod, per-chemistry. We generally don't recommend trying to use the same thresholds across model versions or chemistries.

Do you get similar counts when you run modkit summary?

HalfPhoton commented 2 months ago

Duplicated: https://github.com/nanoporetech/dorado/issues/1041

eesiribloom commented 2 months ago

Apologies for the duplication as I wasn't sure where the best place to get an answer was.

The plots are generated from two samples of the same cancer type and cohort, aligned with the same version and parameters of minimap2 to GRCh38. Only difference - not to minimise or underestimate this - is the pore chemistries and subsequent basecalling models.

I suppose if I knew the large peak was some sort of artefact - it does seem strange to have this huge C:m count when the R10.1 samples don't - I might find a way to filter it out.

Obviously I am willing to accept some differences in output based on the different basecalling models and pore chemistries but I'd be concerned about taking this big peak forward if it might affect downstream analyses, especially as it is (perhaps oddly) at the high threshold so not easily removed that way.

Even just understanding the potential reason for the difference would be super helpful.

eesiribloom commented 2 months ago

I've tried running modkit summary on both samples For the R9.4.1 sample

> sampling 10042 reads from BAM
> calculating threshold at 10(th) percentile
> calculated thresholds: C: 0.640625
# bases             C 
# total_reads_used  10042 
# count_reads_C     10042 
# pass_threshold_C  0.640625 
 base  code  pass_count  pass_frac    all_count  all_frac 
 C     h     17673       0.017328935  31400      0.027721277 
 C     -     265376      0.26020953   307517     0.2714893 
 C     m     736806      0.7224615    793787     0.70078945

For the R10.1 sample

> sampling 10042 reads from BAM
> calculating threshold at 10(th) percentile
> calculated thresholds: C: 0.7519531
# bases             C 
# total_reads_used  10042 
# count_reads_C     10042 
# pass_threshold_C  0.7519531 
 base  code  pass_count  pass_frac   all_count  all_frac 
 C     h     30896       0.04399692  54607      0.07007546 
 C     m     392828      0.55939996  434672     0.557801 
 C     -     278507      0.3966031   289981     0.37212354

Curious to know what you might recommend in this situation - maybe just sticking with default is best? But am I right in thinking the default estimated threshold is not specific to different modifications?

One concern is that a --filter-percentile 0.1 seems quite permissive (e.g. threshold of 0.64 above isn't very high). Would you recommend upping this perhaps?

That being said, what happens at sites which are hemimethylated e.g. only on one allele? I notice a distinct rise in the count of 5hmCG at ~0.5 probability and I wonder if this is allele-specific methylation at play. It seems a shame to filter out that data too...

eesiribloom commented 1 month ago

Just to follow up. I have 13 samples: 7 samples sequenced with R9.4.1 and 6 samples sequenced with R10.1 Using modkit summary and looking at the pass_count for code m: R9.4.1 samples range from 0.63-0.73 (mean 0.67) R10.1 samples range from 0.52-0.69 (mean 0.60)

ArtRand commented 1 month ago

Hello @eesiribloom,

Sorry about the delay in responding, I've been OOO.

Curious to know what you might recommend in this situation - maybe just sticking with default is best? But am I right in thinking the default estimated threshold is not specific to different modifications?

I generally recommend using the default (estimated) threshold value. If you want to use a per-modification threshold, use sample-probs as you've done and pick a value from the output table. Instructions for how to use the --filter-threshold and --mod-threshold arguments are here.

One concern is that a --filter-percentile 0.1 seems quite permissive (e.g. threshold of 0.64 above isn't very high). Would you recommend upping this perhaps?

As far as --filter-perdentile 0.1 resolving to a low pass threshold, the most recent 5hmCG_5mCG model (dna_r10.4.1_e8.2_400bps_sup@v5.0.0_5mCG_5hmCG@v2.0.1) should have a reduced bias in the canonical probabilities. If you don't want to re-basecall the data, I would use a threshold value for canonical, 5mC and 5hmC as described above.

That being said, what happens at sites which are hemimethylated e.g. only on one allele? I notice a distinct rise in the count of 5hmCG at ~0.5 probability and I wonder if this is allele-specific methylation at play. It seems a shame to filter out that data too...

Hemi-methylation is only really detectable with duplex reads (and pileup-hemi, docs). What you're observing (just a guess) is the background false positive probability distribution of 5hmC. If you post a chart of the distribution maybe I can comment on that.

Getting to the root of your question, however, I imagine the difference in 5mC calls in the R9 vs R10 data is mostly due to the combined accuracy of the basecaller and the base modification caller. My theory is that the R9 model is more likely to call 5mC near base calling errors, of which there are more in R9. The R10 models (especially the 5hmC model) have higher accuracy, I would feel confident carrying those results forward.

eesiribloom commented 1 month ago

Hi @ArtRand thanks very much for the detailed reply. Apologies I meant to say allele-specific methylation, not hemimethylation, I was getting my wires crossed. I was actually wondering, if a site has allele-specific methylation would the probability come out as ~0.5 ? if so these sites would be filtered out automatically

ArtRand commented 1 month ago

Hello @eesiribloom,

I was actually wondering, if a site has allele-specific methylation would the probability come out as ~0.5 ? if so these sites would be filtered out automatically

If you have allele-specific specific methylation the percent modified in the bedMethyl file will be 50%, not the individual read probabilities. Each read reports on a single molecule, which in our hypothetical scenario would either be modified or canonical.

nanoporetech / modkit

Difference in total counts of each probability per modification in different pore chemistries #269