Open colindaven opened 1 month ago
Hello @colindaven,
How are you producing the basecalls/base modification calls now? Are you using dorado/remora? Your commands look fine to me. One note is that modkit
will not emit a bedmethyl record when valid coverage is zero (as documented in the troubleshooting), you may try adding the --no-filtering
flag.
Q1 I am running CPG, CHG and CHH filtering separately - should I be running them together then separating the output file by context ?
You can run them together, however --combine-strands
requires that the motif is reverse-complement palindromic, CHH and CHG are not. So you'll have to either drop --combine-strands
or change the motifs to ones that can be combined.
Q2 - can you see any other problems or internal modkit reasons why I am getting so few bedMethyl and bedGraph results ?
As mentioned above, it's likely that the missing sites have zero valid coverage. Could you use --log-filepath
and report back what the estimated thresholds are?
Q3 - does modkit have other parameters to not produce results for unmethylated Cs in CHG or CHH contexts ?
If you have a strong prior that bases are canonical, you can force the behavior that bases are considered canonical unless there is sufficient evidence that they are modified. There is more information in this thread.
Hi @ArtRand
thanks for the quick and informative reply.
I called this data with dorado - 0.7.2
I believe - and used a pipeline withminimap -y
to map the MM and ML tags from the fastq to the final cram.
Things are looking much better now due to adding the --no-filtering
flag. However, the default settings
- at least for this application and my testing so far - look to be quite off what I'd expect, as I mentioned above.
See the bottom tracks in red - these were taken with default settings are are very sparse. The 3 upper CG, CHG and CHH tracks are closer to what I'd expect and what I saw from Megalodon and modbamtobed. Note the extreme sparsity of the red calls (mostly nothing at all).
I can't imagine effective coverage will be zero since these are 50X datasets. Do you avoid calling in repeat regions since plants are full of repeats.
I'll keep experimenting and discuss these results with colleagues as I get more datasets coming in.
Quantitatively the numbers of CHG and CHH are looking a lot better, relative to the CPGs. I'll have to check how many CPGs there are in this genome though.
wc -l *yl.bedGraph
13459001 TO_mod_pos_CG.bedMethyl.bedGraph
3469108 TO_mod_pos_CHG.bedMethyl.bedGraph
12144973 TO_mod_pos_CHH.bedMethyl.bedGraph
wc -l *yl
13459001 TO_mod_pos_CG.bedMethyl
3469108 TO_mod_pos_CHG.bedMethyl
12144973 TO_mod_pos_CHH.bedMethyl
I don't think this is the threshold you mean - from the command modkit summary. I ran each command above with
--log-filepath cpg_modkit.log
but could not see meaningful coverage data/est thresholds (for me). Maybe that is due to the --no-filtering tag?
T053_mod_pos_summary.log:[src/commands.rs::837][2024-07-22 18:12:58][INFO] calculating threshold at 10% percentile
T053_mod_pos_summary.log:[src/thresholds.rs::94][2024-07-22 18:12:58][DEBUG] calculating per base thresholds
T053_mod_pos_summary.log:[src/thresholds.rs::97][2024-07-22 18:13:05][DEBUG] probs per base took 7s
T053_mod_pos_summary.log:[src/thresholds.rs::114][2024-07-22 18:13:06][DEBUG] filter thresholds took 0s
T053_mod_pos_summary.log:[src/thresholds.rs::85][2024-07-22 18:13:06][INFO] calculated thresholds: C: 0.7050781
BTW, the performance of the tool is really nice!
Thanks. Colin
Hello @colindaven,
Sorry for the delay.
Just to recap, if a genomic position has 0 valid coverage, it won't be written in the bedMethyl/bedGraph table. The --no-filtering
flag essentially disables the pass/fail behavior so valid coverage should be roughly equivalent to the stranded read coverage. It seems strange to me that the modification calls in these regions would all be low confidence. The log file snippet you send has a record:
T053_mod_pos_summary.log:[src/thresholds.rs::85][2024-07-22 18:13:06][INFO] calculated thresholds: C: 0.7050781
The threshold value of 0.7 is just fine from my experience. One thing you could look at is the raw read call probabilities in these regions of sparse calls:
$ modkit extract ${bam} null --read-calls --region ${region} --filter-threshold 0.7050781
where ${region}
is chr:start-stop like usual. This will produce a read calls table where the fail
column will tell you if the call would have failed with the provided threshold. You can then load up this table into a data frame and look at the distribution of call_prob
. I'd be curious to know if it's systematically low in these regions and if so by how much.
Glad you like the tool, the next major release should have more performance improvements as well.
@colindaven Any luck here?
This one slipped through, apologies.
I'm not setting a fail
column - do you mean the inferred
column - and false
? Or I probably need to adjust my command but don't know how quite yet.
Command - note this is quite different to yours -
I omitted null and --read-calls
after they failed with this error
error: a value is required for '--read-calls-path <READ_CALLS_PATH>' but none was supplied
Should
modkit extract --region TO_Heinz1706_hq-KG-ONTv1.chrSL4.0ch03:8400000-8500000 --filter-threshold 0.7050781 /mnt/scratch/TO_mod_pos.cram out3.tsv
_mod_strand fw_soft_clipped_start fw_soft_clipped_end read_length mod_qual mod_code base_qual ref_kmer query_kmer canonical_base modified_primary_base inferred flag
22 6 248321 0.001953125 h 50 . TTCGC C C false 16
22 6 248321 0.009765625 m 50 . TTCGC C C false 16
22 6 248321 0.009765625 h 46 . TTCGG C C false 16
22 6 248321 0.017578125 m 46 . TTCGG C C false 16
22 6 248321 0.005859375 h 50 . AACGA C C false 16
22 6 248321 0.009765625 m 50 . AACGA C C false 16
22 6 248321 0.001953125 h 50 . ATCGA C C false 16
22 6 248321 0.001953125 m 50 . ATCGA C C false 16
22 6 248321 0.005859375 h 50 . ATCGT C C false 16
22 6 248321 0.009765625 m 50 . ATCGT C C false 16
22 6 248321 0.001953125 h 50 . TCCGA C C false 16
22 6 248321 0.005859375 m 50 . TCCGA C C false 16
22 6 248321 0.001953125 h 44 . ATCGA C C false 16
22 6 248321 0.009765625 m 44 . ATCGA C C false 16
22 6 248321 0.001953125 h 50 . TGCGA C C false 16
22 6 248321 0.005859375 m 50 . TGCGA C C false 16
22 6 248321 0.009765625 h 50 . GCCGA C C false 16
22 6 248321 0.013671875 m 50 . GCCGA C C false 16
22 6 248321 0.021484375 h 50 . GGCGA C C false 16
22 6 248321 0.029296875 m 50 . GGCGA C C false 16
Hi,
I've been puzzled by the lack of calls/site for CHG and CHH methylation for a few days now. I called similar data with Megalodon and modbamtobed a year or two ago and got very different results (similar numbers of sites across all three, and very densely across the genome - with modkit most of the genome is not covered by sites so far). CHH sites are quite broadly defined, so there should be quite a few present in the genome.
Workarounds tried so far
Yet I still seem to get way too few CHG and CHH sites, especially after filtering
Code examples (in nextflow) - just the modkit lines
Awk filtering by aligned read coverage to get bedGraph
Thanks Colin