Open eesiribloom opened 2 months ago
Hello @eesiribloom,
Just a sanity check, are these plots generated from similar samples (and similar alignments). What are you looking for in terms of a mitigation strategy?
The R10 models have a heavier tail as a result of more diverse examples used in training. In general, these models will have higher performance in more contexts. You should still be able to use the (default) estimated threshold calculated by modkit
. If you want to use modification-specific thresholds, you should use a different value per-mod, per-chemistry. We generally don't recommend trying to use the same thresholds across model versions or chemistries.
Do you get similar counts when you run modkit summary
?
Apologies for the duplication as I wasn't sure where the best place to get an answer was.
The plots are generated from two samples of the same cancer type and cohort, aligned with the same version and parameters of minimap2 to GRCh38. Only difference - not to minimise or underestimate this - is the pore chemistries and subsequent basecalling models.
I suppose if I knew the large peak was some sort of artefact - it does seem strange to have this huge C:m count when the R10.1 samples don't - I might find a way to filter it out.
Obviously I am willing to accept some differences in output based on the different basecalling models and pore chemistries but I'd be concerned about taking this big peak forward if it might affect downstream analyses, especially as it is (perhaps oddly) at the high threshold so not easily removed that way.
Even just understanding the potential reason for the difference would be super helpful.
I've tried running modkit summary
on both samples
For the R9.4.1 sample
> sampling 10042 reads from BAM
> calculating threshold at 10(th) percentile
> calculated thresholds: C: 0.640625
# bases C
# total_reads_used 10042
# count_reads_C 10042
# pass_threshold_C 0.640625
base code pass_count pass_frac all_count all_frac
C h 17673 0.017328935 31400 0.027721277
C - 265376 0.26020953 307517 0.2714893
C m 736806 0.7224615 793787 0.70078945
For the R10.1 sample
> sampling 10042 reads from BAM
> calculating threshold at 10(th) percentile
> calculated thresholds: C: 0.7519531
# bases C
# total_reads_used 10042
# count_reads_C 10042
# pass_threshold_C 0.7519531
base code pass_count pass_frac all_count all_frac
C h 30896 0.04399692 54607 0.07007546
C m 392828 0.55939996 434672 0.557801
C - 278507 0.3966031 289981 0.37212354
Curious to know what you might recommend in this situation - maybe just sticking with default is best? But am I right in thinking the default estimated threshold is not specific to different modifications?
One concern is that a --filter-percentile 0.1 seems quite permissive (e.g. threshold of 0.64 above isn't very high). Would you recommend upping this perhaps?
That being said, what happens at sites which are hemimethylated e.g. only on one allele? I notice a distinct rise in the count of 5hmCG at ~0.5 probability and I wonder if this is allele-specific methylation at play. It seems a shame to filter out that data too...
Just to follow up. I have 13 samples: 7 samples sequenced with R9.4.1 and 6 samples sequenced with R10.1
Using modkit summary
and looking at the pass_count
for code m
:
R9.4.1 samples range from 0.63-0.73 (mean 0.67)
R10.1 samples range from 0.52-0.69 (mean 0.60)
Hello @eesiribloom,
Sorry about the delay in responding, I've been OOO.
Curious to know what you might recommend in this situation - maybe just sticking with default is best? But am I right in thinking the default estimated threshold is not specific to different modifications?
I generally recommend using the default (estimated) threshold value. If you want to use a per-modification threshold, use sample-probs
as you've done and pick a value from the output table. Instructions for how to use the --filter-threshold
and --mod-threshold
arguments are here.
One concern is that a --filter-percentile 0.1 seems quite permissive (e.g. threshold of 0.64 above isn't very high). Would you recommend upping this perhaps?
As far as --filter-perdentile 0.1
resolving to a low pass threshold, the most recent 5hmCG_5mCG model (dna_r10.4.1_e8.2_400bps_sup@v5.0.0_5mCG_5hmCG@v2.0.1
) should have a reduced bias in the canonical probabilities. If you don't want to re-basecall the data, I would use a threshold value for canonical, 5mC and 5hmC as described above.
That being said, what happens at sites which are hemimethylated e.g. only on one allele? I notice a distinct rise in the count of 5hmCG at ~0.5 probability and I wonder if this is allele-specific methylation at play. It seems a shame to filter out that data too...
Hemi-methylation is only really detectable with duplex reads (and pileup-hemi
, docs). What you're observing (just a guess) is the background false positive probability distribution of 5hmC. If you post a chart of the distribution maybe I can comment on that.
Getting to the root of your question, however, I imagine the difference in 5mC calls in the R9 vs R10 data is mostly due to the combined accuracy of the basecaller and the base modification caller. My theory is that the R9 model is more likely to call 5mC near base calling errors, of which there are more in R9. The R10 models (especially the 5hmC model) have higher accuracy, I would feel confident carrying those results forward.
Hi @ArtRand thanks very much for the detailed reply. Apologies I meant to say allele-specific methylation, not hemimethylation, I was getting my wires crossed. I was actually wondering, if a site has allele-specific methylation would the probability come out as ~0.5 ? if so these sites would be filtered out automatically
Hello @eesiribloom,
I was actually wondering, if a site has allele-specific methylation would the probability come out as ~0.5 ? if so these sites would be filtered out automatically
If you have allele-specific specific methylation the percent modified in the bedMethyl file will be 50%, not the individual read probabilities. Each read reports on a single molecule, which in our hypothetical scenario would either be modified or canonical.
I have samples basecalled with dorado software (v0.4.1) with detection of 5mCG_5hmCG modifications enabled using either:
dna_r10.4.1_e8.2_400bps_hac@v4.2.0 dna_r9.4.1_e8_hac@v3.3
I've noticed the two "batches" have distinct distribution of probabilities in base modifications. In particular the R9.4.1 samples have a massive peak of C:m at the far right of the histogram which looks like some sort of artefact (second plot).
I'm wondering what the explanation for this would be and what is the best way to mitigate this issue?
command: