nanoporetech / modkit

A bioinformatics tool for working with modified bases
https://nanoporetech.com/
Other
128 stars 7 forks source link

modkit results look off for CHG and CHH contexts - plant methylation #234

Open colindaven opened 1 month ago

colindaven commented 1 month ago

Hi,

I've been puzzled by the lack of calls/site for CHG and CHH methylation for a few days now. I called similar data with Megalodon and modbamtobed a year or two ago and got very different results (similar numbers of sites across all three, and very densely across the genome - with modkit most of the genome is not covered by sites so far). CHH sites are quite broadly defined, so there should be quite a few present in the genome.

Workarounds tried so far

Yet I still seem to get way too few CHG and CHH sites, especially after filtering

wc -l *.bedMethyl
  13432960 TO_mod_pos_CG.bedMethyl
   2077794 TO_mod_pos_CHG.bedMethyl
   7516912 TO_mod_pos_CHH.bedMethyl

wc -l *.bedGraph
 13243677 TO_mod_pos_CG.bedMethyl.bedGraph
    85938 TO_mod_pos_CHG.bedMethyl.bedGraph
   324101 TO_mod_pos_CHH.bedMethyl.bedGraph

Code examples (in nextflow) - just the modkit lines

process modkit_pileup_cpg {
    modkit pileup $cram ${prefix}_CG.bedMethyl -t $task.cpus --ref $params.fasta --motif CG 0 --combine-strands --ignore h
process modkit_pileup_chg {
    modkit pileup $cram ${prefix}_CHG.bedMethyl -t $task.cpus --ref $params.fasta --motif CHG 0 --ignore h
process modkit_pileup_chh {
    modkit pileup $cram ${prefix}_CHH.bedMethyl -t $task.cpus --ref $params.fasta --motif CHH 0 --ignore h

Awk filtering by aligned read coverage to get bedGraph

    awk '\$10 >= $params.minCoverageForMethylation' $bedMethyl | cut -f1-3,11 | sort -T '.' -k1,1 -k2,2n >  ${bedMethyl}.bedGraph
    awk '\$10 >= $params.minCoverageForMethylation' $bedMethyl | cut -f1-3,11 | sort -T '.' -k1,1 -k2,2n >  ${bedMethyl}.bedGraph
    awk '\$10 >= $params.minCoverageForMethylation' $bedMethyl | cut -f1-3,11 | sort -T '.' -k1,1 -k2,2n >  ${bedMethyl}.bedGraph

Thanks Colin

ArtRand commented 1 month ago

Hello @colindaven,

How are you producing the basecalls/base modification calls now? Are you using dorado/remora? Your commands look fine to me. One note is that modkit will not emit a bedmethyl record when valid coverage is zero (as documented in the troubleshooting), you may try adding the --no-filtering flag.

Q1 I am running CPG, CHG and CHH filtering separately - should I be running them together then separating the output file by context ?

You can run them together, however --combine-strands requires that the motif is reverse-complement palindromic, CHH and CHG are not. So you'll have to either drop --combine-strands or change the motifs to ones that can be combined.

Q2 - can you see any other problems or internal modkit reasons why I am getting so few bedMethyl and bedGraph results ?

As mentioned above, it's likely that the missing sites have zero valid coverage. Could you use --log-filepath and report back what the estimated thresholds are?

Q3 - does modkit have other parameters to not produce results for unmethylated Cs in CHG or CHH contexts ?

If you have a strong prior that bases are canonical, you can force the behavior that bases are considered canonical unless there is sufficient evidence that they are modified. There is more information in this thread.

colindaven commented 1 month ago

Hi @ArtRand

thanks for the quick and informative reply.

I called this data with dorado - 0.7.2 I believe - and used a pipeline withminimap -y to map the MM and ML tags from the fastq to the final cram.

Things are looking much better now due to adding the --no-filtering flag. However, the default settings - at least for this application and my testing so far - look to be quite off what I'd expect, as I mentioned above.

See the bottom tracks in red - these were taken with default settings are are very sparse. The 3 upper CG, CHG and CHH tracks are closer to what I'd expect and what I saw from Megalodon and modbamtobed. Note the extreme sparsity of the red calls (mostly nothing at all).

modkit_methyl_no_filter_vs_defaults

I can't imagine effective coverage will be zero since these are 50X datasets. Do you avoid calling in repeat regions since plants are full of repeats.

I'll keep experimenting and discuss these results with colleagues as I get more datasets coming in.

Quantitatively the numbers of CHG and CHH are looking a lot better, relative to the CPGs. I'll have to check how many CPGs there are in this genome though.


wc -l *yl.bedGraph
  13459001 TO_mod_pos_CG.bedMethyl.bedGraph
   3469108 TO_mod_pos_CHG.bedMethyl.bedGraph
  12144973 TO_mod_pos_CHH.bedMethyl.bedGraph

wc -l *yl
  13459001 TO_mod_pos_CG.bedMethyl
   3469108 TO_mod_pos_CHG.bedMethyl
  12144973 TO_mod_pos_CHH.bedMethyl

I don't think this is the threshold you mean - from the command modkit summary. I ran each command above with --log-filepath cpg_modkit.log but could not see meaningful coverage data/est thresholds (for me). Maybe that is due to the --no-filtering tag?

T053_mod_pos_summary.log:[src/commands.rs::837][2024-07-22 18:12:58][INFO] calculating threshold at 10% percentile
T053_mod_pos_summary.log:[src/thresholds.rs::94][2024-07-22 18:12:58][DEBUG] calculating per base thresholds
T053_mod_pos_summary.log:[src/thresholds.rs::97][2024-07-22 18:13:05][DEBUG] probs per base took 7s
T053_mod_pos_summary.log:[src/thresholds.rs::114][2024-07-22 18:13:06][DEBUG] filter thresholds took 0s
T053_mod_pos_summary.log:[src/thresholds.rs::85][2024-07-22 18:13:06][INFO] calculated thresholds: C: 0.7050781

BTW, the performance of the tool is really nice!

Thanks. Colin

ArtRand commented 1 month ago

Hello @colindaven,

Sorry for the delay.

Just to recap, if a genomic position has 0 valid coverage, it won't be written in the bedMethyl/bedGraph table. The --no-filtering flag essentially disables the pass/fail behavior so valid coverage should be roughly equivalent to the stranded read coverage. It seems strange to me that the modification calls in these regions would all be low confidence. The log file snippet you send has a record:

T053_mod_pos_summary.log:[src/thresholds.rs::85][2024-07-22 18:13:06][INFO] calculated thresholds: C: 0.7050781

The threshold value of 0.7 is just fine from my experience. One thing you could look at is the raw read call probabilities in these regions of sparse calls:

$ modkit extract ${bam} null --read-calls --region ${region} --filter-threshold 0.7050781

where ${region} is chr:start-stop like usual. This will produce a read calls table where the fail column will tell you if the call would have failed with the provided threshold. You can then load up this table into a data frame and look at the distribution of call_prob. I'd be curious to know if it's systematically low in these regions and if so by how much.

Glad you like the tool, the next major release should have more performance improvements as well.

ArtRand commented 2 weeks ago

@colindaven Any luck here?

colindaven commented 2 weeks ago

This one slipped through, apologies.

I'm not setting a fail column - do you mean the inferred column - and false? Or I probably need to adjust my command but don't know how quite yet.

Command - note this is quite different to yours -

I omitted null and --read-calls after they failed with this error error: a value is required for '--read-calls-path <READ_CALLS_PATH>' but none was supplied

Should by the path to the bam/cram again (or the CPG calls tsv output file?).

modkit extract --region TO_Heinz1706_hq-KG-ONTv1.chrSL4.0ch03:8400000-8500000 --filter-threshold 0.7050781 /mnt/scratch/TO_mod_pos.cram out3.tsv

_mod_strand  fw_soft_clipped_start  fw_soft_clipped_end  read_length  mod_qual     mod_code  base_qual  ref_kmer  query_kmer  canonical_base  modified_primary_base  inferred  flag
             22                     6                    248321       0.001953125  h         50         .         TTCGC       C               C                      false     16
             22                     6                    248321       0.009765625  m         50         .         TTCGC       C               C                      false     16
             22                     6                    248321       0.009765625  h         46         .         TTCGG       C               C                      false     16
             22                     6                    248321       0.017578125  m         46         .         TTCGG       C               C                      false     16
             22                     6                    248321       0.005859375  h         50         .         AACGA       C               C                      false     16
             22                     6                    248321       0.009765625  m         50         .         AACGA       C               C                      false     16
             22                     6                    248321       0.001953125  h         50         .         ATCGA       C               C                      false     16
             22                     6                    248321       0.001953125  m         50         .         ATCGA       C               C                      false     16
             22                     6                    248321       0.005859375  h         50         .         ATCGT       C               C                      false     16
             22                     6                    248321       0.009765625  m         50         .         ATCGT       C               C                      false     16
             22                     6                    248321       0.001953125  h         50         .         TCCGA       C               C                      false     16
             22                     6                    248321       0.005859375  m         50         .         TCCGA       C               C                      false     16
             22                     6                    248321       0.001953125  h         44         .         ATCGA       C               C                      false     16
             22                     6                    248321       0.009765625  m         44         .         ATCGA       C               C                      false     16
             22                     6                    248321       0.001953125  h         50         .         TGCGA       C               C                      false     16
             22                     6                    248321       0.005859375  m         50         .         TGCGA       C               C                      false     16
             22                     6                    248321       0.009765625  h         50         .         GCCGA       C               C                      false     16
             22                     6                    248321       0.013671875  m         50         .         GCCGA       C               C                      false     16
             22                     6                    248321       0.021484375  h         50         .         GGCGA       C               C                      false     16
             22                     6                    248321       0.029296875  m         50         .         GGCGA       C               C                      false     16