igvteam / igv

Integrative Genomics Viewer. Fast, efficient, scalable visualization tool for genomics data and annotations
https://igv.org
MIT License
647 stars 386 forks source link

5mC aggregated visualization with low C counts #1185

Open marcus1487 opened 2 years ago

marcus1487 commented 2 years ago

The new work on the 5mC-specific visualization mode is really great and makes the modified and canonical calls much more apparent, but I have found a weird behavior which I think may need addressing.

In the attached screenshot you can see that at a non-C reference site the methylation visualization is a bit misleading. At this reference A site, there are 26 C basecalls and 20 of those have been called 5mC (m). This makes up <1% of the total 4002 reads covering this location, yet the aggregation makes it appear as though a large majority of calls at this site are 5mC.

I'm not sure if this computation was completely spec'ed out in the last issue (https://github.com/igvteam/igv/issues/1133), but it seems that the height of the C and 5mC coloring in the aggregation bar should only represent the portion of all reads at that site. In this example, I would expect that the blue and red bars would consume <1% of the total height together, but clearly consume much more of the coverage bar (it's not clear to me why they don't consume 100% of the bar; maybe indels?). If the current logic is expected (I know this topic was a bit secondary in the last thread) I would suggest a discussion about the relative merit of the current implementation versus the alternative proposed here.

I am happy to share this data privately in order to help debug this issue, but would prefer not to post this particular example publicly. Again great work on this feature!

image
jrobinso commented 2 years ago

If you can share a small bam for this that would be helpful. You can send a private link to igv-team@broadinstitute.org.

I'm curious, how is it you have a call at a non-C reference site? This visualization is not designed for that case, clearly. I think the answer here is, as you suggest, make the height of Red+Blue in total to be proportional to the number of the base in question ("C" in this case, or "G" on - strand).

We might need to add some clarification on what this visualization is designed for, its not the general case we envision. BTW do we have enough data yet to start on the general case?

marcus1487 commented 2 years ago

I've sent the example bam file via email.

The 5mC call at a non-C reference site is due to a canonical basecall error (or SNP) where the basecalled C is aligned to a non-C in the reference. I agree that some clarification of the intended use would be good, but I think we are on the same page that showing a high level of 5mC in the aggregation bar at a non-C reference site where <1% of basecalls are C is not ideal no matter the intended use of the coloring mode.

I will try to get the general 5hmC data sent (or it may be provided from the related #1186 .

jrobinso commented 2 years ago

@marcus1487 I see the issue, its intuitively obvious. but its the consequence of a decision deliberately taken. In the previous version the height of the colored bar was propotional to

(modification call count) / (total count)

This resulted in bars ~ 1/2 height on average because negative strand reads would not, of course, have a modifiable base (e.g. G instead of C). The fix was to change the height to be proportional to

(modification call count) / (modifiable count)

In this example case there are very few modifications, but also there are very few modifiable bases, so the ratio is close to 1.

Its not immediately obvious what to do here, that is what sort of ratio should be applied to yield the intuitively right thing.

BTW I'm on vacation for ~10 days, returning Aug 24. Will pick this up, and start looking at the more general case if I get data other than 5mC when I return.

jrobinso commented 2 years ago

I think for this specific case we might have to take account of the reference base, which won't generalize well to the generic case.

jrobinso commented 2 years ago

Here's another case to consider, same dataset. This one has a "G" reference base, however there is some extreme strand bias, 2277 "Gs" on the + strand, 173 on the -. There are also 5 "+" C calls, so only 178 modifiable (for 5mC) bases out of 2,485 total. 98 of the 178 have a call. So intuitively you might expect a small bar, but I don't see any rule to apply that wouldn't negate the changes for the "normal" case, again made specifically so the bars appear full height.

Screen Shot 2022-08-12 at 10 11 57 PM

Its possible this (5mC mode) isn't the right visualization for datasets such as this. I'm slightly more hopeful we can do something reasonable for the case you raised than I am for the strand bias case.