nanoporetech / modkit

A bioinformatics tool for working with modified bases
https://nanoporetech.com/
Other
139 stars 8 forks source link

modkit summary results validation. #2

Closed faulk-lab closed 1 year ago

faulk-lab commented 1 year ago

I am comparing modkit summary against the modkit pileup bed file to confirm whether the percentages match up and it appears something is off. I called a fast5 sequenced with mouse brain DNA using dorado using a 5mc_5hmc model so I expect ample quantities of both modifications. I used modkit pileup with the --cpg flag and receive a bedmethyl with the format described in the modkit documentation.

Modkit summary reveals the following: ./dist/modkit summary calls.modmapped.bam mod_bases C count_reads_C 4081393 C_calls_modified_m 22817251 C_frac_modified_m 0.4827792093865955 C_filtered_modified_m 0 C_calls_unmodified 17031731 C_frac_unmodified 0.36036618200260717 C_filtered_unmodified 0 C_calls_modified_h 7413308 C_frac_modified_h 0.1568546086107973 C_filtered_modified_h 0 C_total_mod_calls 47262290 C_total_filtered_mod_calls 0 total_reads_used 4081393

However, my when I run awk against the resulting .bed file I see the following: awk '$4=="m" {can+=$13; mod+=$12; oth+=$14; valid+=$10} END{print (can/valid) " CpG canonical\n" (mod/valid) " 5mCpG modified\n" (oth/valid) " 5hmCpG modified"}' calls.modmapped.modkit.bed 0.442508 CpG canonical 0.386512 5mCpG modified 0.17098 5hmCpG modified

I don't expect the percentages to match up perfectly because of differences in the mapping in the .bam file and in the resulting .bed file of course, but they should be close. The 5hmC percentage is probably correct, but the canonical C and modified C appear to be switched. Do I have that right? According to the Readme.md, column 13 is canonical, and column 12 is modified (in this case, column 14 "other" is 5hmC count). Is it possible that 'modkit summary' is parsing columns 12 and 13 as swapped?

ArtRand commented 1 year ago

Hello @faulk-lab,

One quick point of clarification that you may already be aware of, modkit summary doesn't compute statistics on the bedMethyl, it computes statistics on the read-level base modification calls. In general however, your intuition is correct that there is a correspondence between the two. The --cpg flag used in modkit pileup sub-samples the modification calls to only ones aligned to a reference CpG, whereas all of the base modification calls are used by modkit summary to compute the statistics shown.

Did you use the default filtering settings when running modkit pileup? From the output you posted, it looks like you used modkit summary with the default settings that don't perform any filtering. When running modkit pileup the model confidence threshold should be logged, you could try running modkit summary with the --filter-threshold option set to the same value as you used when making the pileup. It's possible that the lower confidence calls are biased towards 5mC. More than happy to continue to help troubleshoot.

faulk-lab commented 1 year ago

Hi ArtRand,

Yes the two methods of determining "global" values are measuring somewhat different things in the bam vs. the bed, but as you say, they should be essentially near each other. Your suggestion was very helpful in two ways. 1) I re-ran the modkit pileup without the --cpg flag and the --ref to get a bedmethyl of all modified C's in the bam (which should realistically still be CpG sites since the dorado basecaller's model only calls 5mC_5hmC in CpG context which is why I didn't think it mattered before). Apparently it does indeed matter. Now when I awk to summarize the .bed without that --cpg flag I get: ./dist/modkit pileup -t 32 calls.modmapped.bam calls.modmapped.modkit.bed awk '$4=="m" {can+=$13; mod+=$12; oth+=$14; valid+=$10} END{print (can/valid) " CpG canonical\n" (mod/valid) " 5mCpG modified\n" (oth/valid) " 5hmCpG modified"}' calls.modmapped.modkit.bed 0.357482 CpG canonical 0.465608 5mCpG modified 0.17691 5hmCpG modified

Compared to the ./dist/modkit summary calls.modmapped.aln.bam -t 32 --filter-threshold 0.7128906 mod_bases C count_reads_C 4032558 C_calls_modified_m 20592271 C_frac_modified_m 0.4818301395620696 C_filtered_modified_m 2074299 C_calls_unmodified 16296287 C_frac_unmodified 0.381310164359897 C_filtered_unmodified 618311 C_calls_modified_h 5849057 C_frac_modified_h 0.13685969607803336 C_filtered_modified_h 1521848 C_total_mod_calls 42737615 C_total_filtered_mod_calls 4214458 total_reads_used 4032558

It's pretty close. 35.7% vs 38.1% unmodified 46.5% vs. 48.1% 5mc modified 17.6% vs 13.6% 5hmc modified

Of course, the numbers I really want are the global percentages of the mapped reads. So I plan to run modkit pileup --cpg --ref and run my awk script on the .bed output of that.

Do you agree that those global percentages would be reliable?

(PS. 2nd way you were helpful. I used the --filter-threshold 0.7128906 with modkit summary this time because I saw that's the value that modkit pileup chose for itself. I don't have an opinion on thresholds so I'll just go with what modkit chooses)

faulk-lab commented 1 year ago

Now for a second weird validation question. Modkit vs. Modbam2bed. I ran the same bam file as above through modbam2bed against the same reference genome and wrote an awk script to summarize each modification type. Here's the awk script for 5mC: awk '$5>0 {can+=$12; mc+=$13; hmc+=$16} END{print 100*(mc/(can+mc+hmc))}' I modified it to also give me canonical and 5hmC. In theory, it should give me the same percentages as the awk script from modkit pileup --cpg --ref's bedmethyl file, just by adding up different columns, since they report the same counts. Yet here's what I see.

modkit pileup --cpg --ref 0.442508 CpG canonical 0.386512 5mCpG modified 0.17098 5hmCpG modified

modbam2bed -m 5mC -t 32 -e -p calls.modmapped --aggregate --cpg ~/Desktop/genomes/mouse/mm39.fa -f 0.7128906 calls.modmapped.bam > calls.modmapped.bed Summarize with awk '$5>0 {can+=$12; mc+=$13; hmc+=$16} END{print 100*(can/(can+mc+hmc))}' calls.modmapped.cpg.acc.bed 30.6254 CpG canonical 54.3297 5mCpG 15.045 5hmCpG

Now those are pretty far apart. Do you have any thoughts?

ArtRand commented 1 year ago

One tiny point, running pileup with --cpg --ref doesn't aggregate across strands, do to that you need to add the --combine-strands flag. It shouldn't matter the way you're doing the arithmetic as long as you only use the m lines, but just something to keep in mind.

To answer your first question, I'd imagine that the calculation you're performing will give a global estimate of the levels of 5mC and 5hmC. As you can imagine, when you start looking for changes it gets a little more complicated (but that's a discussion for another time).

Regarding comparison to modbam2bed, a few things to check.

  1. Modkit filters/removes calls that are <= threshold, whereas modbam2bed uses < threshold (keeping calls == to the threshold value). In general, I don't recommend lowering the threshold. Given that the probabilities are discretized when making the modBAM into 255 bins, you could increase the threshold value given to modbam2bed to 0.7167969.

  2. Run modkit with the --log-filepath option, and take a look at what gets logged. If your reads have the . mode, you need to force modkit to use those reads with the --force-allow-implicit option. The CpG base modification models should be emitting MM tags with the ? mode, but if you've merged in old data this could happen.

  3. Could you print out the actual counts for a sanity check?

faulk-lab commented 1 year ago

Thanks for the confirmation of my awk calculation of the global 5mc 5hmc values. That's an important number in some studies. I had also been including the --combine-strands flag, I just omitted it for brevity.

  1. I re-ran modbam2bed with 0.7167969 and it didn't make much of a difference. 30.6666 canonical 54.3143 5mC 15.0191 5hmC

  2. I ran modkit with the --log-filepath option and it looked fine. No . or any old merged data. Only MM and ML flags.

  3. Counts are interesting modbam2bed site_count valid_site_count percent 10,010,083 32,641,637 30.6666 canonical 17,729,077 32,641,637 54.3143 5mc 4,902,477 32,641,637 15.0191 5hmc Total lines = 13,570,243

modkit site_count valid_site_count percent 8,181,072 18,487,960 0.442508 canonical 7,145,819 18,487,960 0.386512 5mc 3,161,069 18,487,960 0.17098 5hmc Total lines = 7,865,946 ("m" only lines)

Mystery solved? Questions.

  1. There are nearly twice as many lines from modbam2bed as from modkit's output. There are nearly twice as many valid sites from the modbam2bed output as from modkit.
  2. For every line that they share in common, the read counts are exactly the same. That is reassuring.
  3. Output shows that modbam2bed is recording more sites than modkit. Example at bottom. This is a low-pass experiment so the coverage is expected to be low.
  4. Why is this happening? Is modkit more stringent in some parameter that I am missing?
  5. Which file to trust, modbam2bed or modkit's bedmethyl?

My commands to generate and assess the two bedmethyls were as follows: modbam2bed -m 5mC -t 32 -e -p calls.modmapped --aggregate --cpg ~/Desktop/genomes/mouse/mm39.fa -f 0.7128906 calls.modmapped.bam > calls.modmapped.bed awk '$5>0 {can+=$12; mc+=$13; hmc+=$16} END{print can "\t" (can+mc+hmc) "\t" 100*(can/(can+mc+hmc))}' calls.modmapped.cpg.acc.bed

/dist/modkit pileup -t 32 --cpg --ref ~/Desktop/genomes/mouse/mm39.fa --combine-strands --log-filepath logmodkit.txt calls.modmapped.bam calls.modmapped.modkit.bed awk '$4=="m" {can+=$13; mod+=$12; oth+=$14; valid+=$10} END{print oth "\t" valid "\t" (can/valid) " CpG canonical\n" (mod/valid) " 5mCpG modified\n" (oth/valid) " 5hmCpG modified"}' calls.modmapped.modkit.bed

Modbam2bed output chr1 3050194 3050196 5mC 750 . 3050194 3050196 0,0,0 4 100.00 0 3 1 0 0 chr1 3050222 3050224 5mC 750 . 3050222 3050224 0,0,0 4 66.67 0 2 1 0 1 chr1 3050261 3050263 5mC 1000 . 3050261 3050263 0,0,0 3 33.33 2 1 0 0 0 chr1 3050330 3050332 5mC 1000 . 3050330 3050332 0,0,0 3 33.33 2 1 0 0 0 chr1 3050402 3050404 5mC 500 . 3050402 3050404 0,0,0 2 100.00 0 1 1 0 0 chr1 3050555 3050557 5mC 1000 . 3050555 3050557 0,0,0 2 100.00 0 2 0 0 0 chr1 3050665 3050667 5mC 1000 . 3050665 3050667 0,0,0 2 50.00 1 1 0 0 0 chr1 3051030 3051032 5mC 1000 . 3051030 3051032 0,0,0 1 100.00 0 1 0 0 0

Modkit output Note the skip from 3050261 directly to 3051030 chr1 3050194 3050195 m 3 . 3050194 3050195 255,0,0 3 100.00 3 0 0 0 1 0 0 chr1 3050222 3050223 m 3 . 3050222 3050223 255,0,0 3 66.67 2 0 1 0 1 0 0 chr1 3050261 3050262 m 3 . 3050261 3050262 255,0,0 3 33.33 1 2 0 0 0 0 0 chr1 3051030 3051031 m 1 . 3051030 3051031 255,0,0 1 100.00 1 0 0 0 0 0 0 chr1 3051199 3051200 m 1 . 3051199 3051200 255,0,0 1 0.00 0 1 0 0 0 0 0

faulk-lab commented 1 year ago

Take a look at chr1 3050330 3050332. I confirmed a CpG exists at that position in mm39. I tried modbam2bed with a very stringent threshold (0.98) and while it doesn't call mods a most positions, it still lists all the positions including 3050330. I tried calling modkit --no-filtering, and while it warns me, it still does not call position 3050330.

What is going on? I cannot see any reason that modkit doesn't like position chr1 3050330, or many others, while modbam2bed is perfectly happy with them.

ArtRand commented 1 year ago

It's not the thresholds, there's a bug in the way that modkit is finding the CGs, it's missing the masked ones (cg in the reference, 🤦 ). It's a trivial fix, I can give you a patch build in an hour or so. Alternatively, you could run modkit without the --cpg flag, so it doesn't perform any filering on motifs. You'll get rows that are not aligned to CGs, however these can usually be identified because the N_diff is high and score is low. Sorry about the hassle.

faulk-lab commented 1 year ago

Well that makes sense. Hell, it could even be a feature if it were switchable! (--all --masked --unmasked). Results now make a lot of sense in with that info. We are seeing 54% 5mC across ALL sites from modbam2bed, and just 38% across UNmasked sites because of the modkit bug.

Methylation is generally higher in repeats which modkit is excluding, so it makes sense that modkit's total is lower. Plus the 5hmC results are totally new to science, 15% in all sites vs. 17% in unmasked regions, almost no bias in 5hmC in transposons. There you go. You accidentally discovered a neat fact :)

Looking forward to the fix. Also, modkit appears to run about 2x faster than modbam2bed. I assume because it's written in rust. Very nice.

cjw85 commented 1 year ago

@ArtRand I had this difference in my mental notes and forgot to point it out in review. I could have saved you both the goose chase!

modbam2bed had that option --mask to respect soft masking in reference files.

ArtRand commented 1 year ago

Ok, I've put a new release into v0.1.3. I tested using the motif-bed subcommand that the missing sites from chr1 are recognized. I also added a --mask flag that will reproduce the old behavior and ignore masked sites. I like your suggestion, so I may borrow it for the next release, but I saw it a little too late for this sprint effort. The speed difference you're seeing could be in part to modkit doing less work with the missing locations.. but now you'll be able to see. TBH I have not done any optimization work yet. Really excited to see results from the 5hmC model!

faulk-lab commented 1 year ago

I've tested v0.1.3 and things are improving. With default filtering and threshold settings the total site counts are much closer now, with modkit a still not reporting as many as modbam2bed. The percentages are closer now too. Probably still different due to the million missing lines.

modbam2bed (defaults) 30.1295 CpG canonical 54.4692 5mCpG modified 15.4013 5hmCpG modified Total lines = 13,570,243

modkit (defaults) 0.358723 CpG canonical 0.466281 5mCpG modified 0.174996 5hmCpG modified Total lines 12,307,062

So here's the philosophical question. Do I trust modbam2bed's calculations, assuming a static default threshold of 0.66, or should I prefer modkit's threshold because it is dynamically produced from the actual bam file samples? I am inclined to trust modkit more.

I'd like your opinions.

ArtRand commented 1 year ago

Part of the motivation for modkit was to implement filtering out the lowest confidence base modification calls based on the data in the BAM records. This is the way we perform filtering when reporting metrics. The modbam2bed default threshold is very low, and modkit will even give a warning with 0.66. In our experience the threshold is important to get the best performance. So to answer your question, I would trust modkit's calculations of the %5mC, %5hmC at the sites produced in the bedMethyl based on the default settings, dropping the 10% lowest confidence calls.

The difference in total lines is a little concerning, however. Modkit will not produce lines when there aren't any base modification counts (which could happen if all the calls were filtered or there was 0 coverage). This is a slight shift in stance from modbam2bed, trying to focus on the base modification counts and ratios. You could try running modkit with --no-filtering or --filter-threshold 0.65 to determine if this is the case. If it's not, I need to look more closely into what's going on.

faulk-lab commented 1 year ago

That's what I suspected. Thanks for the explanation and I believe I will use modkit going forward.

I ran with --no-filtering and get 13,036,854 reads, just 533,389 short of the modbam2bed total. So it helps. I suspect the reason has to do with this being low-pass data, so many lines may have 0 reads. I think we are on solid ground now though.

I'll close the thread unless you want me to do any more checks and tests. Thanks a million for your help!

ArtRand commented 1 year ago

You bet, more than happy to help, and thanks for the suggestion about masking. Research like yours is one of the reasons I like working on this stuff!