Closed faulk-lab closed 1 year ago
Hello @faulk-lab,
One quick point of clarification that you may already be aware of, modkit summary
doesn't compute statistics on the bedMethyl, it computes statistics on the read-level base modification calls. In general however, your intuition is correct that there is a correspondence between the two. The --cpg
flag used in modkit pileup
sub-samples the modification calls to only ones aligned to a reference CpG, whereas all of the base modification calls are used by modkit summary
to compute the statistics shown.
Did you use the default filtering settings when running modkit pileup
? From the output you posted, it looks like you used modkit summary
with the default settings that don't perform any filtering. When running modkit pileup
the model confidence threshold should be logged, you could try running modkit summary
with the --filter-threshold
option set to the same value as you used when making the pileup. It's possible that the lower confidence calls are biased towards 5mC. More than happy to continue to help troubleshoot.
Hi ArtRand,
Yes the two methods of determining "global" values are measuring somewhat different things in the bam vs. the bed, but as you say, they should be essentially near each other. Your suggestion was very helpful in two ways. 1) I re-ran the modkit pileup
without the --cpg
flag and the --ref
to get a bedmethyl of all modified C's in the bam (which should realistically still be CpG sites since the dorado basecaller's model only calls 5mC_5hmC in CpG context which is why I didn't think it mattered before). Apparently it does indeed matter. Now when I awk to summarize the .bed without that --cpg flag I get:
./dist/modkit pileup -t 32 calls.modmapped.bam calls.modmapped.modkit.bed
awk '$4=="m" {can+=$13; mod+=$12; oth+=$14; valid+=$10} END{print (can/valid) " CpG canonical\n" (mod/valid) " 5mCpG modified\n" (oth/valid) " 5hmCpG modified"}' calls.modmapped.modkit.bed
0.357482 CpG canonical
0.465608 5mCpG modified
0.17691 5hmCpG modified
Compared to the ./dist/modkit summary calls.modmapped.aln.bam -t 32 --filter-threshold 0.7128906
mod_bases C
count_reads_C 4032558
C_calls_modified_m 20592271
C_frac_modified_m 0.4818301395620696
C_filtered_modified_m 2074299
C_calls_unmodified 16296287
C_frac_unmodified 0.381310164359897
C_filtered_unmodified 618311
C_calls_modified_h 5849057
C_frac_modified_h 0.13685969607803336
C_filtered_modified_h 1521848
C_total_mod_calls 42737615
C_total_filtered_mod_calls 4214458
total_reads_used 4032558
It's pretty close. 35.7% vs 38.1% unmodified 46.5% vs. 48.1% 5mc modified 17.6% vs 13.6% 5hmc modified
Of course, the numbers I really want are the global percentages of the mapped reads. So I plan to run modkit pileup --cpg --ref
and run my awk script on the .bed output of that.
Do you agree that those global percentages would be reliable?
(PS. 2nd way you were helpful. I used the --filter-threshold 0.7128906
with modkit summary this time because I saw that's the value that modkit pileup chose for itself. I don't have an opinion on thresholds so I'll just go with what modkit chooses)
Now for a second weird validation question. Modkit vs. Modbam2bed. I ran the same bam file as above through modbam2bed
against the same reference genome and wrote an awk script to summarize each modification type.
Here's the awk script for 5mC: awk '$5>0 {can+=$12; mc+=$13; hmc+=$16} END{print 100*(mc/(can+mc+hmc))}'
I modified it to also give me canonical and 5hmC. In theory, it should give me the same percentages as the awk script from modkit pileup --cpg --ref
's bedmethyl file, just by adding up different columns, since they report the same counts. Yet here's what I see.
modkit pileup --cpg --ref 0.442508 CpG canonical 0.386512 5mCpG modified 0.17098 5hmCpG modified
modbam2bed -m 5mC -t 32 -e -p calls.modmapped --aggregate --cpg ~/Desktop/genomes/mouse/mm39.fa -f 0.7128906 calls.modmapped.bam > calls.modmapped.bed
Summarize with awk '$5>0 {can+=$12; mc+=$13; hmc+=$16} END{print 100*(can/(can+mc+hmc))}' calls.modmapped.cpg.acc.bed
30.6254 CpG canonical
54.3297 5mCpG
15.045 5hmCpG
Now those are pretty far apart. Do you have any thoughts?
One tiny point, running pileup
with --cpg --ref
doesn't aggregate across strands, do to that you need to add the --combine-strands
flag. It shouldn't matter the way you're doing the arithmetic as long as you only use the m
lines, but just something to keep in mind.
To answer your first question, I'd imagine that the calculation you're performing will give a global estimate of the levels of 5mC and 5hmC. As you can imagine, when you start looking for changes it gets a little more complicated (but that's a discussion for another time).
Regarding comparison to modbam2bed, a few things to check.
Modkit filters/removes calls that are <= threshold, whereas modbam2bed uses < threshold (keeping calls == to the threshold value). In general, I don't recommend lowering the threshold. Given that the probabilities are discretized when making the modBAM into 255 bins, you could increase the threshold value given to modbam2bed to 0.7167969
.
Run modkit
with the --log-filepath
option, and take a look at what gets logged. If your reads have the .
mode, you need to force modkit
to use those reads with the --force-allow-implicit
option. The CpG base modification models should be emitting MM tags with the ?
mode, but if you've merged in old data this could happen.
Could you print out the actual counts for a sanity check?
Thanks for the confirmation of my awk calculation of the global 5mc 5hmc values. That's an important number in some studies. I had also been including the --combine-strands
flag, I just omitted it for brevity.
I re-ran modbam2bed with 0.7167969
and it didn't make much of a difference.
30.6666 canonical
54.3143 5mC
15.0191 5hmC
I ran modkit with the --log-filepath
option and it looked fine. No .
or any old merged data. Only MM and ML flags.
Counts are interesting modbam2bed site_count valid_site_count percent 10,010,083 32,641,637 30.6666 canonical 17,729,077 32,641,637 54.3143 5mc 4,902,477 32,641,637 15.0191 5hmc Total lines = 13,570,243
modkit site_count valid_site_count percent 8,181,072 18,487,960 0.442508 canonical 7,145,819 18,487,960 0.386512 5mc 3,161,069 18,487,960 0.17098 5hmc Total lines = 7,865,946 ("m" only lines)
Mystery solved? Questions.
My commands to generate and assess the two bedmethyls were as follows:
modbam2bed -m 5mC -t 32 -e -p calls.modmapped --aggregate --cpg ~/Desktop/genomes/mouse/mm39.fa -f 0.7128906 calls.modmapped.bam > calls.modmapped.bed
awk '$5>0 {can+=$12; mc+=$13; hmc+=$16} END{print can "\t" (can+mc+hmc) "\t" 100*(can/(can+mc+hmc))}' calls.modmapped.cpg.acc.bed
/dist/modkit pileup -t 32 --cpg --ref ~/Desktop/genomes/mouse/mm39.fa --combine-strands --log-filepath logmodkit.txt calls.modmapped.bam calls.modmapped.modkit.bed
awk '$4=="m" {can+=$13; mod+=$12; oth+=$14; valid+=$10} END{print oth "\t" valid "\t" (can/valid) " CpG canonical\n" (mod/valid) " 5mCpG modified\n" (oth/valid) " 5hmCpG modified"}' calls.modmapped.modkit.bed
Modbam2bed output chr1 3050194 3050196 5mC 750 . 3050194 3050196 0,0,0 4 100.00 0 3 1 0 0 chr1 3050222 3050224 5mC 750 . 3050222 3050224 0,0,0 4 66.67 0 2 1 0 1 chr1 3050261 3050263 5mC 1000 . 3050261 3050263 0,0,0 3 33.33 2 1 0 0 0 chr1 3050330 3050332 5mC 1000 . 3050330 3050332 0,0,0 3 33.33 2 1 0 0 0 chr1 3050402 3050404 5mC 500 . 3050402 3050404 0,0,0 2 100.00 0 1 1 0 0 chr1 3050555 3050557 5mC 1000 . 3050555 3050557 0,0,0 2 100.00 0 2 0 0 0 chr1 3050665 3050667 5mC 1000 . 3050665 3050667 0,0,0 2 50.00 1 1 0 0 0 chr1 3051030 3051032 5mC 1000 . 3051030 3051032 0,0,0 1 100.00 0 1 0 0 0
Modkit output Note the skip from 3050261 directly to 3051030 chr1 3050194 3050195 m 3 . 3050194 3050195 255,0,0 3 100.00 3 0 0 0 1 0 0 chr1 3050222 3050223 m 3 . 3050222 3050223 255,0,0 3 66.67 2 0 1 0 1 0 0 chr1 3050261 3050262 m 3 . 3050261 3050262 255,0,0 3 33.33 1 2 0 0 0 0 0 chr1 3051030 3051031 m 1 . 3051030 3051031 255,0,0 1 100.00 1 0 0 0 0 0 0 chr1 3051199 3051200 m 1 . 3051199 3051200 255,0,0 1 0.00 0 1 0 0 0 0 0
Take a look at chr1 3050330 3050332. I confirmed a CpG exists at that position in mm39. I tried modbam2bed
with a very stringent threshold (0.98) and while it doesn't call mods a most positions, it still lists all the positions including 3050330. I tried calling modkit --no-filtering
, and while it warns me, it still does not call position 3050330.
What is going on? I cannot see any reason that modkit
doesn't like position chr1 3050330, or many others, while modbam2bed
is perfectly happy with them.
It's not the thresholds, there's a bug in the way that modkit is finding the CGs, it's missing the masked ones (cg
in the reference, 🤦 ). It's a trivial fix, I can give you a patch build in an hour or so. Alternatively, you could run modkit without the --cpg
flag, so it doesn't perform any filering on motifs. You'll get rows that are not aligned to CGs, however these can usually be identified because the N_diff
is high and score is low. Sorry about the hassle.
Well that makes sense. Hell, it could even be a feature if it were switchable! (--all --masked --unmasked). Results now make a lot of sense in with that info. We are seeing 54% 5mC across ALL sites from modbam2bed, and just 38% across UNmasked sites because of the modkit bug.
Methylation is generally higher in repeats which modkit is excluding, so it makes sense that modkit's total is lower. Plus the 5hmC results are totally new to science, 15% in all sites vs. 17% in unmasked regions, almost no bias in 5hmC in transposons. There you go. You accidentally discovered a neat fact :)
Looking forward to the fix. Also, modkit appears to run about 2x faster than modbam2bed. I assume because it's written in rust. Very nice.
@ArtRand I had this difference in my mental notes and forgot to point it out in review. I could have saved you both the goose chase!
modbam2bed
had that option --mask
to respect soft masking in reference files.
Ok, I've put a new release into v0.1.3. I tested using the motif-bed
subcommand that the missing sites from chr1 are recognized. I also added a --mask
flag that will reproduce the old behavior and ignore masked sites. I like your suggestion, so I may borrow it for the next release, but I saw it a little too late for this sprint effort. The speed difference you're seeing could be in part to modkit doing less work with the missing locations.. but now you'll be able to see. TBH I have not done any optimization work yet. Really excited to see results from the 5hmC model!
I've tested v0.1.3 and things are improving. With default filtering and threshold settings the total site counts are much closer now, with modkit
a still not reporting as many as modbam2bed
. The percentages are closer now too. Probably still different due to the million missing lines.
modbam2bed (defaults) 30.1295 CpG canonical 54.4692 5mCpG modified 15.4013 5hmCpG modified Total lines = 13,570,243
modkit (defaults) 0.358723 CpG canonical 0.466281 5mCpG modified 0.174996 5hmCpG modified Total lines 12,307,062
So here's the philosophical question. Do I trust modbam2bed's
calculations, assuming a static default threshold of 0.66, or should I prefer modkit's
threshold because it is dynamically produced from the actual bam file samples? I am inclined to trust modkit
more.
I'd like your opinions.
Part of the motivation for modkit was to implement filtering out the lowest confidence base modification calls based on the data in the BAM records. This is the way we perform filtering when reporting metrics. The modbam2bed default threshold is very low, and modkit will even give a warning with 0.66. In our experience the threshold is important to get the best performance. So to answer your question, I would trust modkit's calculations of the %5mC, %5hmC at the sites produced in the bedMethyl based on the default settings, dropping the 10% lowest confidence calls.
The difference in total lines is a little concerning, however. Modkit will not produce lines when there aren't any base modification counts (which could happen if all the calls were filtered or there was 0 coverage). This is a slight shift in stance from modbam2bed, trying to focus on the base modification counts and ratios. You could try running modkit with --no-filtering
or --filter-threshold 0.65
to determine if this is the case. If it's not, I need to look more closely into what's going on.
That's what I suspected. Thanks for the explanation and I believe I will use modkit going forward.
I ran with --no-filtering
and get 13,036,854 reads, just 533,389 short of the modbam2bed total. So it helps. I suspect the reason has to do with this being low-pass data, so many lines may have 0 reads. I think we are on solid ground now though.
I'll close the thread unless you want me to do any more checks and tests. Thanks a million for your help!
You bet, more than happy to help, and thanks for the suggestion about masking. Research like yours is one of the reasons I like working on this stuff!
I am comparing
modkit summary
against themodkit pileup
bed file to confirm whether the percentages match up and it appears something is off. I called a fast5 sequenced with mouse brain DNA using dorado using a 5mc_5hmc model so I expect ample quantities of both modifications. I used modkit pileup with the--cpg flag
and receive a bedmethyl with the format described in the modkit documentation.Modkit summary reveals the following: ./dist/modkit summary calls.modmapped.bam mod_bases C count_reads_C 4081393 C_calls_modified_m 22817251 C_frac_modified_m 0.4827792093865955 C_filtered_modified_m 0 C_calls_unmodified 17031731 C_frac_unmodified 0.36036618200260717 C_filtered_unmodified 0 C_calls_modified_h 7413308 C_frac_modified_h 0.1568546086107973 C_filtered_modified_h 0 C_total_mod_calls 47262290 C_total_filtered_mod_calls 0 total_reads_used 4081393
However, my when I run awk against the resulting .bed file I see the following: awk '$4=="m" {can+=$13; mod+=$12; oth+=$14; valid+=$10} END{print (can/valid) " CpG canonical\n" (mod/valid) " 5mCpG modified\n" (oth/valid) " 5hmCpG modified"}' calls.modmapped.modkit.bed 0.442508 CpG canonical 0.386512 5mCpG modified 0.17098 5hmCpG modified
I don't expect the percentages to match up perfectly because of differences in the mapping in the .bam file and in the resulting .bed file of course, but they should be close. The 5hmC percentage is probably correct, but the canonical C and modified C appear to be switched. Do I have that right? According to the Readme.md, column 13 is canonical, and column 12 is modified (in this case, column 14 "other" is 5hmC count). Is it possible that 'modkit summary' is parsing columns 12 and 13 as swapped?