Interpretation of base modification likelihood in presence of many no-calls

jrobinso commented 1 year ago

Opening this issue to discuss a possible edge case, but occurring frequently in one of my test datasets.

How would you asses the likelihood of a modification at a position with 2 reads modified with likelihood of 100%, and 8 reads with no calls. This situation can only arise when using the "?" modifier, but it does arise.

Possibilities are (1) very likely since all of our calls are at 100%, (2) not very likely as we have only 2 calls, or (3) unknown because most of our calls are unknown.

In IGV the interpretation will be reflected in the coverage track, although there is not currently a visual representation of "unknown".

jrobinso commented 1 year ago

@ctsa @ @mrvollger @marcus1487 Welcome your thoughts.

jrobinso commented 1 year ago

Complementary question, 2 reads with likelihood 0% and 8 reads with no call. How confident are we that the base is not modified?

marcus1487 commented 1 year ago

I think the interpretation is a bit off here. For each individual base in a read there is a likelihood that the base is modified on that read given the sequencer observations. The ground truth is that (assuming the canonical basecalled sequence is correct) the base is either canonical or one of the modeled modified bases.

But at aggregation time this is no longer a likelihood. In a human sample the "ground truth" might be that 30% of reads are methylated. Thus the aggregation should not be interpreted as a likelihood, but an estimate of a fraction. As such the missing calls (in ? mode) provides no information about the fraction of reads with a modified base at this position.

If we could separate modified base coverage from total coverage the ideal here would be to have the total bar height be 2 in both cases and the fraction be 0% or 100% in the two proposed cases. Given the modified base annotation of the coverage track, I still think that the 0% or 100% fraction are the most logical choices, but could be misleading (which I think is the core reason for this thread).

jrobinso commented 1 year ago

Yes, got it, I agree on the interpretation. So let's not talk about likelihood for sites. For discussion let's assume we have a separate modified base coverage, either as bars or heatmaps as suggested in other thread. There is still a counting ambiguity due to likelihoods. Consider these 3 reads. What is the summary count for each modification? The alignment base color is assuming the red-blue diverging scheme. This is what will be seen in the alignment pileup, there's an expectation the coverage more or less reflects this.

Read	5hmC	5mC	Not modified	Alignment base color
1.	22%,	7%,	71%	Blue.
2.	7%,	0%,	93%	Blue
3.	10%,	82%,	8%	Red
4	11%	12%	77%	Blue
5.	50%	0%.	50%	Grey. (alpha is 100% in this case)
6.	?	?	?	Grey
7	25%	25%	50%	Grey. (alpha is 100%)

I'm starting to think trying to summarize the reads is a mistake, but its apparently quite useful to give some indication in the coverage track (or perhaps a new summary track) of the general state.

This is a lot easier with bisulfite sequencing obviously as there are only 2 states, and they are absolute (no likelihood).

EDITED WITH MORE EXAMPLES

marcus1487 commented 1 year ago

I generally agree that the summary in the coverage track is useful. IMO given that the quantity being estimated is the fraction of modified bases and not the centroid of the estimated probabilities, it follows that hard thresholded calls from each read would be the best unit for aggregation. In my ideal visualization there would be a threshold under which the aggregated color would be shaded in order to display locations where the fraction of reads identified may be of lower quality. I'm happy to describe the exact rule I would envision if you think this is feasible.

I understand that this is just my opinion and requires extra dev effort, so happy with keeping the centroid of read estimates if that is decided.

Writing my response as your comment was updating, so I agree with the interpretation of the above table. If it helps an extra example of 25%, 25%, 50% would still be Grey (to my understanding).

jrobinso commented 1 year ago

I tend to agree with you that hard thresholded calls is the best unit, for one it will match what you see in the read pileup. The difference in practice for most datasets won't be that noticeable, so I think its o.k. to make this change. Dealing with the likelihood centroids is proving almost intractable and does not have a ready interpretation in any event.

RE your extra example, added, and yes it would be gray. In this case the "non-modified" likelhood wins => blue, but the 50% results in 100% alpha. So visually a 50% likelihood in the 2 color scheme is no different than "?", but in a way 50% is a quantitative value for unknown, equal chances for modifed and unmodified.

jrobinso commented 1 year ago

@marcus1487 And yes, if you can describe your exact rule for shading below threshold that would be useful.

marcus1487 commented 1 year ago

My rule for shaded values in aggregation would be to 1. pick the "winner" call (highest probability call including canonical) 2. Then if the winner probability is <50% (possible in the >2 mods case) the base would contribute a grey value. If the value is between 50% and the threshold it would contribute a value at 50% alpha and if the winner likelihood is greater than the threshold it would contribute a 100% alpha region. Then the coverage bar would be distributed equally with the colors from the above rule applied to each read with a valid modified base call at the position.

marcus1487 commented 1 year ago

So to annotate your table, with a threshold value of 80% one would get:

Read	5hmC	5mC	Not modified	Alignment base color
1	22%	7%,	71%	Blue (50% alpha)
2	7%	0%,	93%	Blue (100% alpha)
3	10%,	82%,	8%	Red (100% alpha)
4	11%	12%	77%	Blue (50% alpha)
5	50%	0%.	50%	Grey
6	?	?	?	Grey
7	25%	25%	50%	Grey

And in practice deciding which side the 50% value falls on is not relevant since there are an even number of discrete bins, so a 50% probability in the tag format is not actually possible.

jrobinso commented 1 year ago

One more case to consider, 5hmC and 5mC both at 40%. So there is no winner > 50%, but this case would mean there is 80% likelihood it is either 5hmC or 5mC. I don't know how likely this is, and that would depend on the platform and pipeline, but in your rule this base would be grey, correct?

marcus1487 commented 1 year ago

Correct. I would say that if a user wants to see the probability of "5mC or 5hmC" then they could use modkit to collapse these values into one. IGV (or any other visualization) should show it as unknown which is what the calls say. That might be just me, but feels like it gives users the options they might want now that we have the tooling in place to manipulate these files.

mrvollger commented 1 year ago

I think I agree with all the conversation here, and my case is simpler since I don't need/want the two-color scheme for m6A.

I think this is in sync with what is being suggested but for m6A I'd like the coverage track to show the fraction of reads that have an m6A likelihood above some threshold. I think a reasonable place for the threshold would be a 50% likelihood; however, I am not too attached to where that boundary is set.

Thanks! Mitchell

jrobinso commented 1 year ago

There will be 1-color and 2-color options for all modifications. I don't really think consensus is possible on all details, so I will make as much configurable as possible in the development time available. I think there is another issue open for 6mA specifically that wants a threshold of 80%.

jrobinso commented 1 year ago

I've finished a complete refactoring of base mods along the lines discussed above. The most significant change is in the coverage track, which is now based on a count of modifications exceeding a threshold, rather than a sum of likelihoods (i.e. average likelihood). The average likelihood is used to apply alpha in the same manner as the alignment bases themselves. Single vs double strand recording of modifications (C+m / G-m) is detected and accounted for. The implementation overall is much cleaner and easier to understand, but somewhat ironically perhaps the visual change will not be noticeable for most datasets.

Another significant change is in the menu. There are "monocolor" and "two color" options for all modifications. If more than 1 modification is present you can choose to see all, or a specific modification. This is a generalization of a request to see "6mA" only. The menu looks as follows for a dataset with 5mC and 6mA modifications. (This dataset records 5mC modifications on top strand only, and 6mA on both strands, so it made an interesting test case).

The new build is available on the development snapshot page: https://software.broadinstitute.org/software/igv/download_snapshot

Draft documentation can be seen here: https://igv.org/doc/desktop/#UserGuide/tracks/alignments/base_modifications/

jrobinso commented 1 year ago

Also, nearly everything is configurable in preferences @marcus1487 if you wish you can try your 4-color scheme for the unmodified bases here.

mrvollger commented 1 year ago

I am noticing that there isn't a difference between m6A one and two color modes:

Here is a bam file if you want to check it out for yourself: https://s3-us-west-2.amazonaws.com/stergachis-public1/Projects/pacbiome/HG002_WGS.haplotagged.bam And the locus: chr11:5,243,687-5,252,744

jrobinso commented 1 year ago

@mrvollger There doesn't appear to be any low-confidence 6mA calls in this dataset. No information is not the same thing as no modification.

mrvollger commented 1 year ago

I forgot about this part of our previous discussion sorry. So we'd only see something here is we had some m6a with ML tags indicating less than 50%(?) probability. Thanks!

jrobinso commented 1 year ago

@mrvollger Yes, or if you add "." to your tag, or alternatively change the default assumption for skipped bases in preferences. However, if you assume skipped bases indicate evidence of no-modification every A and T read base in your dataset without an MM/ML value will turn blue, since you have both A+ and T- calls. In practice I don't think the "." assumption is going to be useful for anyone for visualization, of course its useful for other purposes.

jrobinso commented 1 year ago

And yes 50% is the default threshold, this can be changed in preferences.

ctsa commented 1 year ago

Thanks @jrobinso this looks like a nice generalization of the basemod viewing options.

igvteam / igv

Interpretation of base modification likelihood in presence of many no-calls #1359