modified_bases.5mC.bed gets very different results than running modbam2bed on mod_mappings.bam

benbfly commented 2 years ago

Megaldon/2.4.2 modbam2bed/0.4.0 (https://github.com/epi2me-labs/modbam2bed) samtools/1.14

I am running Megalodon in CG remora mode ("--remora-modified-bases dna_r9.4.1_e8 hac 0.0.0 5mc CG 0 --guppy-config dna_r9.4.1_450bps_hac.cfg").

I wanted to compare the results in the modified_bases.5mC.bed file, with the results of running the ONT modbam2bed utility on the mod_mappings.bam file. In theory, I thought that they should be very similar or identical. However, in my testing so far, they produce very different results.

I did a small test on 140,000 reads, and the following is the number of CpGs in the output: CpGs covered by modbam2bed and modified_based: 2,255 CpGs covered by modbam2bed only: 1,033 CpGs covered by modified_bases only: 1,799

The bases in only one file were not on the opposite strand or off by 1 position or anything. There were not close to a CpG covered in the other file. I get similar results when doing much larger datasets. The only processing I had to do was a "samtools sort" and "samtools index" on the mod_mappings.bam file in order for it to work with modbam2bed. Also, we are sequencing very short reads (~167bp , cfDNA), which is probably not the typical use case.

This seems like strange behavior. I will look into the differences and see if I can figure out what's going on, but I thought I'd ask the experts. @marcus1487 @cjw85 .

I could always be doing something wrong, but I've checked pretty carefully and I don't think so.

benbfly commented 2 years ago

I just read issue #246 and realized this is likely because the mod_mappings.bam output is reference-anchored and the modified_bases.bed file is basecall-anchored.

From my understanding of your comments in issue #246 , the reference-anchored generally works better? In our data (we are using R9.4 / lsk109), the reference-anchored output generally covers less CpGs (as shown above). But if it's more accurate, it's still preferable.

I'll let you confirm before closing issue.

Thanks! Ben.

benbfly commented 2 years ago

I think it would be a good idea to either discontinue or put a big warning for the creation of the basecall-anchored modified_bases.5mC.bed file. It is the only simple bed file that can be output directly from the megalodon command line, and people (like me) will tend to just grab it assuming it is the right output without understanding that it's inferior. I know this may eventually be rolled into guppy, but in the meantime people will be using megalodon.

thanks for the amazing tools!!!!

marcus1487 commented 2 years ago

The modified_bases.5mC.bed file is reference anchored and therefore should be this highest quality. These files should not be disagreeing nearly this much and I've not seen this level of disagreement before. I am working on reproducing this behavior internally and will report back.

benbfly commented 2 years ago

I just noticed the discussion in #206 which noted that the bed methyl file in megalodon uses default cutoffs of 0.2 and 0.8. I believe modmap2bed uses defaults of 0.333 and 0.667. I don't think this would give the large discrepancy that we are seeing, but I will check that.

benbfly commented 2 years ago

Just to follow up on this. It is not due to default cutoffs for canonical vs. modified. modbam2bed outputs every CG covered by a read, regardless of modification status. So it has to do with what each of the two programs think is a reference CG to output. I gave one example of a covered reference CG output by modbam2bed but not output by modified_bases.5mC.bed (chr1:596372), in issue #249. For some reason it does not have an Mm/Ml tag for this position.

I am still investigating the converse case (output in modified_bases.5mC.bed but not modbam2bed).

marcus1487 commented 2 years ago

The second case is certainly more confusing to me as well. A browser shot of the mappings.bam, mod_mappings.bam at one of these sites would be very helpful.

benbfly commented 2 years ago

Here is an example of the second case (output in modified_bases.5mC.bed but not modbam2bed).

This read from chr16:19113482-19113641 has multiple CpGs in modified_bases.5mC.bed, but not a single one in the output of modbam2bed. One of the CpGs for instance is at position 19113602.

Since this includes multiple CpGs, my best guess is that modbam2bed uses a blacklist or something, and Megalodon bed generator does not.

cjw85 commented 2 years ago

@benbfly Are you running modbam2bed with the --cpg option? This will only output lines to the bed where the reference is sequence is CG (or GC to take into account the second strand). Since the sequences above from the BAM represent read sequences, they will not necessarily lead to a line in the BED file if the reference sequence is not also CG at that point. If you are not running with --cpg then the BED will contain a line for every single reference position (regardless of reference base).

benbfly commented 2 years ago

I am running with --cpg option. But these are definitely reference CGs in the genome I am using (hg38). I believe the Megalodon modification BAM actually shows the reference sequence rather than the read sequence. But in the screenshot I grab it from a fresh version of hg38.

I don't think there's any read filtering that could cause this, since I think Megalodon sets the mapping quality to 40 for all reads and does not set any SAM flags (as you can see above). So my only thought is a blacklist of some kind, or a bug. I see a large number like this in my data.

One thing that might be different from my data than your test datasets. We are doing cfDNA so these reads are very short. Maybe you have a read length filter or something?

cjw85 commented 2 years ago

Can you provide an extra of the BAM file you are processing with modbam2bed and I will take a look.

benbfly commented 2 years ago

Attached... modbam2bed_berman.tar.gz

cjw85 commented 2 years ago

I cannot repoduce the lack of reporting sites:

$ ./modbam2bed -a 0.333 -b 0.667 ~/data/GCA_000001405.15_GRCh38_no_alt_analysis_set.fa.gz --cpg --region=chr16:19113400-19113700 -e modbam2bed_berman/mod_mappings.sorted.bam
Analysing: 5-methylcytosine (5mC, C>m)
Processing: chr16:19113399-19113700
chr16   19113481    19113482    5mC 1000    +   19113481    19113482    0,0,0   1   100.00  0   1   0
chr16   19113493    19113494    5mC 1000    +   19113493    19113494    0,0,0   1   100.00  0   1   0
chr16   19113545    19113546    5mC 1000    +   19113545    19113546    0,0,0   1   100.00  0   1   0
chr16   19113602    19113603    5mC 1000    +   19113602    19113603    0,0,0   1   100.00  0   1   0
chr16   19113625    19113626    5mC 1000    +   19113625    19113626    0,0,0   1   0.00    1   0   0
100 %
Total time: 0.421185s

IGV showing methylated bases

I note in your screenshot above that you have a warning that the index file is older than the fasta file. Is it the correct index corresponding to the fasta?

cjw85 commented 2 years ago

And the one site that isn't methylated:

benbfly commented 2 years ago

Ok I don't have time now, but I'll look at my reference genome. We use it for everything , but maybe it's corrupted somehow.

benbfly commented 2 years ago

could it have to do with lower case vs. upper case letters in reference fasta (repeatmasker)? See my fasta pull above, all the cgs are lower case. I'm not at my computer anymore, but I could look at this later.

I am using v0.4.0 of modbam2bed

cjw85 commented 2 years ago

Ah yes, I believe you are correct. The check for a CpG site is simply to check the reference character is 'C' (or 'G' for reverse strand).

I don't know what I would call the "correct" behaviour here as I believe the intention of writing lowercase bases to a reference is so programs can easily perform this sort of filtering. In your case I suppose you haven't masked the reference for the purpose of masking methylation calls. Pragmatically I will change the default to report bother upper and lowercase reference sites, with an option to ignore lowercase.

benbfly commented 2 years ago

we just use the fasta created by UCSC , and they do lower case for repeat masked regions. We certainly don't want to filter those out, and I think very few people would want to filter them out of the bed file (especially with long reads where repeats are even less of an issue).

I think this accounts for all of the discrepancy I was seeing with the megalodon created bed file (with the --edge-buffer and --mod-min-prob from issue #249 accounting for the ones in the other direction)

if you make a new release , I can compare the two on my dataset to make sure they agree.

here is where you can download the UCSC fasta for testing, if you want: https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/

cjw85 commented 2 years ago

I've implemented allow reference bases to be lowercase in v0.4.4 of modbam2bed. This is available now.

benbfly commented 2 years ago

@cjw85 I downloaded v0.4.4 and it did not fix this problem. I'm checking into it now , maybe I didn't build it correctly. But please keep this issue open until I can check. Thanks - Ben.

cjw85 commented 2 years ago

I'll have a look

cjw85 commented 2 years ago

Apoligies @benbfly, I see the issue.

cjw85 commented 2 years ago

@benbfly Could you try v0.4.5?

benbfly commented 2 years ago

v0.4.5 works!

Now, every CpG covered by modified_bases.5mC.bed is covered by modbam2bed -a 0.2 -b 0.8. So the upper case lower case problem is fixed.

However, there are still a lot of CpGs that are covered by modbam2bed -a 0.2 -b 0.8 that not in modified_bases.5mC.bed , even when I run megalodon with --edge-buffer 0 --mod-min-prob 0. So maybe there are still some filters that megalodon bed is filtering out, that are not being filtered out by modbam2bed. I'm not sure yet if these have Mm/Ml tags, I will check and report back @marcus1487 . So I think keep this issue open until I can do that.

benbfly commented 2 years ago

I see what this difference in behavior is - it is something I noticed before but forgot. It's the fact that modbam2bed outputs the CpGs even if they have 0 reads with "confident" modification calls (in this case modification prob<0.2 or >0.8). In the nomenclature of modbam2bed, these are CpGs with 1 or more "filtered" reads, but no "canonical" or "modified" reads.

Megalodon modified_bases.5mC.bed does not output these at all, which to me is the more reasonable behavior.

Modbam2bed outputs these with percent modified value of 0 (column 11) . This is not good behavior in my opinion, and I would consider it a bug for modbam2bed. The modbam2bed formula for percent modified is Nmod / (Nmod + Ncanon), and in this case the denominator is 0 so it should be undefined. Outputting these as 0 percent modified is incorrect and can lead to inaccurate analyses. It would be better for these to be omitted or with percent modified set to "." (the bed file convention for NA).

I do see that the ones I'm talking about could be easily removed by the user by filtering on column 5==0. But I just think many people will just use the default behavior and treat these as 0% modification.

I think this accounts for all the differences between modified_bases.5mC.bed and modbam2bed (except for the missing Mm/Ml tag issue from #249 ). Probably this was not a big deal with some of your test data, since it usually only makes a difference when you have a single low quality read covering the CpG. If you have multiple reads covering, it usually won't make a difference. We are doing very low depth sequencing (0.3x), so most of our CpGs are only covered by a single read. Thus we get a very large number of these CpGs where Nmod+Ncanon is 0.

cjw85 commented 2 years ago

You are correct, modbam2bed calculates column 11 as (counts.c#L160):

meth = tot == 0 ? 0 : (100.0f * md) / tot;

so the description is not completely accurate. In such cases the column 5 will also be zero representing complete confusion in the value in column 11; in that sense the reported numbers are self-consistent, and the value of column 5 must be taken into account in the evaluation of column 11.

That the interpretation of column 11 is contigent on column 5 is the case in other circumstances also, its just that such facts are often ignored. For example consider the case that 90% of reads contain a substitution to a C>T, is it really correct to conclude from a 100% value in column 11 that there is 100% methylation in the sample? No it is not, however much the casual user wants column 11 to mean that. This is why the definitions in columns 5 and 10 were chosen the way they were; the "specification" referenced in the modbam2bed documentation makes a false dichotomy between modified and unmodified. The modbam2bed avoids making this erroneous assumption, which makes the output messier but rather more truthful. It is also why the extended mode reporting the verbatim counts was added so users can derived whatever summary statistics they wish from the counts (one thing that the extended output misses is that only the sum N_mod + N_del can be derived not the individual components).

Note also there's some connection here to the earlier discussion that within the htslib Mm tag specification where missing data is assume unmodified.

benbfly commented 2 years ago

What would be the downside of outputting . for column 11 rather than 0? One could still look at the extended columns if they wanted to get more information. I don't see a good reason to assume that a single read with a modification probability of 0.5 should be given a modification percentage of 0.

I think when the Mm tag is missing, you just have to follow the current spec - if the alphabet is set to '?' you treat missing Mm as N_filtered, if it is '.' , you treat missing as unmodified.

cjw85 commented 2 years ago

There's little downside to outputting . save for parsers needing to be aware that its possible and what its meaning is. We can look to implement this in a next version. Practically it should make no difference because as highlighted above column 11 should never be interpreted without reference to columns 5 and 10.

colindaven commented 2 years ago

I also support outputting a . instead of a 0 for a missing value, because a missing value or unknown is different to a value of 0, unmethylated. 0 is inherently wrong for positions with no read coverage.

Many users might naively just erroneously evaluate column 11.

I implemented the bedMethyl conversion in a recent pipeline only converting column 11 if by default a coverage of >=5 was reported in column 10.

cjw85 commented 2 years ago

I reread the (recent and admittedly post-hoc) specification for BED files: https://samtools.github.io/hts-specs/

It doesn't acknowledge . to mean a missing or NaN (which is really what we'd be meaning here). It only mentions . in the context of the name and strand fields. Any naive parser is going to trip up over a column containing ostensibly Float type data. I quickly surveyed the bedTools code and it seems to just leave everything stringly-typed. Similar concerns were raised in a comment when the specification was being drafted: https://github.com/samtools/hts-specs/pull/570#issuecomment-857475922

I've opened a issue here https://github.com/samtools/hts-specs/issues/634, to get some clarification. I don't want to make a change here without knowing its going to be the correct one (and have to make another change in the future). There is a sense in which 0 is correct: the bedMethyl description states

Percentage of reads that show methylation at this position in the genome

And it is true that 0% of reads are showing methylation, though admittedly the modbam2bed definition is more precise.

m0nib commented 2 years ago

Hi,

@cjw85 I have run into the same issue as @benbfly. The generated .bed file from modbam2bed contains more CpGs as being unmethylated.

This affects our downstream analysis and prediction of sample type from trained model.

I was wondering, while you wait to make it a permanent change or decision, is there anyway I can rectify this.

I mean by specifying a command line option or other way?

Thank you Best Monib

benbfly commented 2 years ago

I have found that if you specify the meglodon command line parameters --mod-min-prob 0, and set the modbam2bed parameters -a 0.2 -b 0.8, and update to modbam2bed version 0.4.5 or higher, that the results are quite close. By default, modbam2bed uses a lower threshold for methylation (-a 0.333 -b 0.667), so that is part of why it outputs more CpGs. Which of these settings is preferred may be application specific. For our application we have a very small number of reads so we want more CpGs (so we use the 0.333/0.667). But in general, I think I would tend to prefer the more selective 0.2/0.8 setting.

We also use the megalodon parameter --edge-buffer 0, but that's something that is mostly relevant because we are doing short read sequencing (<200bp).

cjw85 commented 2 years ago

I'm waiting for a reply on the hts-specs for BED file as currently there is no guidance on how NaNs should be recorded in a BED file. I'll give this a couple of weeks and then make a decision. Currently my thought is that Column 11 in the modbam2bed output will be recorded as nan where the total = mod. + canon = 0 has occurred. This follows the VCF specification for such things and will allow a natural parsing of the file from code; the suggestion of using . would require user code to contain special handling whereas NaN is supported in various data analysis environments.

As it stands filtering the output file based on a the values of columns 5 and 10 is the recommended approach.

A second option is to provide an option to elide such rows from the output. This will necessarily be lossy, and you will lose information about filtered canon or mod calls (or maybe more interesting substitutions).

benbfly commented 2 years ago

@cjw85 I don't think there will be any clarity on NaNs in BED format (unfortunately, BED is not a real spec). Bedtools, the most popular BED manipulation tool, will not allow numeric operations on any column that contains a non-number including NaN or . (bedtools::isNumeric). So if you want to be as compliant as possible, you would need to remove those rows (or provide a tool to make a "compliant" version). I suggested . because the original BED spec used . for other fields with missing data, but I'm not actually aware of any tool that would parse that as NaN for a numeric field. So I think you're right that NaN would be a better option if that would allow it to be parsed correctly by other packages.

@arq5x is the BED guru and might have an idea about this.

marcus1487 commented 2 years ago

Note that 2.5.0 release sets edge buffer to 0 and min mod prob to 0 to call all covered positions be default and reduce this issue for most users.

michaelmhoffman commented 2 years ago

unfortunately, BED is not a real spec

Ouch! Probably deserved though.

I would consider any custom BED fields to have use that "may be determined by private agreement among cooperating users", as Unicode says. Without any such other agreement or specification, you can't expect that NaN will—or won't—be used in any particular way. I would do what you think is best and document it. And if there's other software you use that applies heuristics to figure out data types (like BEDTools), you can always request that they add NaN to what they consider numeric.

nanoporetech / megalodon

modified_bases.5mC.bed gets very different results than running modbam2bed on mod_mappings.bam #247