Open Ge0rges opened 1 week ago
Hello @Ge0rges,
Could you send me an IGV shot of "Region 2" (contig_60201:3680-3682
)? Looks like you're getting 4mC calls there, it's possible that you have A->C transversions there, but you shouldn't get 4mC calls on adenine bases. ~Is it possible that this bedMethyl is a concatenation of more than a single run?~ Edit: re-read the issue and you mentioned that this is a concatenation. So multiple entries for the same position is expected. But having such consistent A->C is unusual. IGV should help figure it out.
Maybe email me a modBAM of the reads that align to that region so I can check the tags (modkit summary ${bam} --region contig_60201:3680-3682
would also tell you).
Oh, great idea to use IGV. I kept worrying my code to look at the sequence might be wrong. Here's a screenshot.
I'm regenerating the modems (I usually keep just the pileup) and will get back to you with the summary (and/or can email it to you if more helpful).
Edit: I will note that IGV shows the first nucleotide on position 1.
Here's the summary. Ranmodkit summary ${bam} --region contig_60201:3680-3682
on all my modbams.
summary.txt
@Ge0rges
Thanks. The summaries look fine, you're not getting 4mC on A bases in the tags. My current hypothesis is that you have a lot of reads with cytosines mapping to a reference adenine and bringing along the 4mC call.
For the IGV shot, could you add the modBAM of alignments?
Here it is. Looks like perhaps this is more of an issue with my assembly than modkit? IGV.pdf
@Ge0rges
Looks like there is a pretty consistent T->G (reverse mapped A->C) transversion there. You have a couple more in this screen shot. That's actually pretty interesting. Have you tried polishing the assembly with medaka's new --bacteria
option? My guess is there's a MTase in there producing a consistent sequencing error. Try coloring them by strand, a strand-biased error is usually a strong indicator.
I have not tried the new polisher. Not against trying it. I can't really tell if the error is strand-biased, maybe a little? IGV.pdf
Do you have any understanding of how the polishing performs in these methylated cases? I could run this one through the polisher as a test.
Yeah, I agree, it's not completely strand-biased but that doesn't mean it's not a methylation-induced error. The new medaka model is designed to fix cases where methylation causes errors, it's pretty quick - so it might be worth a shot. Plus you could look back at this position and see if the assembly changes.
Alright, I'm running it. So if I'm following you, the modkit output assuming this reference is correct?
Correct. You'll have to re-align the reads to the polished assembly (I think medaka might do this for you). Then use those alignments in modkit. But first I'd just look at this region on IGV and see if the assembly has changed.
So modkit doesn't strictly require the reference to have a compatible base then? I'm trying to understand how to interpret this output while I wait for the medaka result.
So modkit doesn't strictly require the reference to have a compatible base then?
As you've seen modkit pileup
(default) reports all genomic positions with a modified base passing the threshold, even if the reference base is not compatible with the modification (i.e when there is a mismatch). A couple things could happen, the polished assembly may have a compatible base (e.g. a C) where there was previously an A>C mismatch. The polished assembly may change enough in the local region such that the reads align differently as well. Although in the screen shot you've sent - this doesn't seem overly likely.
Got it. So if I had continue this analysis with the current assembly, a simplification would be to pass a filter over the pileup to enforce the modification/base to match. I'm interested to see how Medaka performs! It might take 48h still as I don't have a GPU to run it on. Will keep you posted here.
Sounds good. I've considered adding an argument to pileup
that would require a read base to match the reference. I've been reluctant because there could be some sharp edges. You may never have discovered this otherwise... Maybe adding a column that indicates when the reported modification is a mismatch.
Yeah I definitely don't think it should be a default. That column might be a good in-between. For the longest time I was doing analyses that weren't sequence aware, which is why I only recently found this (I started looking at methylation around start codons).
Hi @ArtRand,
Let me preface this by saying this may be a mistake on my end. However after checking, I have a suspicion that modkit is setting an incompatible methylation call/nucleotide pairing.
Here's my example.
Consider the following (concatenated, filtered) bed file of a certain example regions: Region 1:
Region 2:
Here's the sequence of this contig:
So, the region is on the negative strand, so I first slice the sequence to
seq[3259: 3262]
obtainTCT
, whose complement is thenAGA
for region 1. Region 2 I getAT
.