Improve support for long-read methylation marks beyond 5mC (C+m)

amwenger commented 2 years ago

@jrobinso @mschatz @marcus1487

Split from #1133. Related to #1153.

This issue splits from #1133, which provides nice support for visualizing 5mC methylation (C+m in MM tag), to support other marks - particularly 6mA, 5hmC, and 4mC - and both strand encoding (A+a, T-a).

Support simultaneous visualization of different marks (C+m, A+a) in the same read, with a different diverging color scheme for each mark. ** Rather than pre-assigning a color scheme to each mark, there could be a limited list of color schemes (perhaps 3 or 4), which the user could assign to marks. This would limit the number of colors shown at any one time. There would be default assignments: e.g. red<->blue for C+m, orange<->purple for A+a.
Support multiple marks at the same base (e.g. 5mC and 5hmC).

marcus1487 commented 2 years ago

There are two bits I think we should clarify here.

First is that with more than one mod against the same base (e.g. 5mC and 5hmC) IGV (or any software) will need to aggregate the ML values over all mods to compute the implied canonical base probability. This has certain implications metioned below.

A sub-issue with this first issue is in the ? tag version case; I'm not sure what the interpretation should be for example when 5mC has a likelihood but the 5hmC likelihood is missing. This would mean that the canonical probability is also missing. Nanopore software would not output this data type and I don't foresee a setting in which we would add this output, but it may arise from other software and may technically follow the tag spec. I would suggest treating these sites as missing data (in the ? tag case; obviously in the . tag case we would interpret 5hmC prob is 0 since it is missing).

The second point of clarification is that multiple modified bases don't really have separate canonical states. For example I would imagine that for a 5mC and 5hmC visualization C would remain colored blue, 5mC could remain red, and 5hmC may choose a new color (purple?; on the biological pathway to return a 5mC to C). I would suggest that the color is chosen for the most likely base (canonical or modified) and that the alpha is set as max(2 * base_prob - 1), 0) (on a 0-1 scale; note that this should match the current settings for 2 bases and logically extend to multiple bases).

To expand on this, reads could be annotated with 5mC, 5hmC and 6mA. In this case I would propose that in the modified base coloring scheme that a canonical base remain the same color no matter the canonical base identity. So canonical A or C bases (which are called as defined by the .? specifies) would be colored blue. The the mods would each have a color 5mC colored red, 5hmC colored purple and 6mA colored yellow. I could also see an argument for canonical colors matching the "standard" coloring (a=green, c=blue, g=yellow, g=red). This would force a change from the current standard as red is already the color for T, and so 5mC could not be red also. That may be a logical change in any case to visualize mods and SNPs simultaneously (currently a SNP C->T and called 5mC would both be colored red; which is not ideal).

For the "standard color scheme to share screenshots", I agree with @jrobinso that this is quite important. I would suggest that a common dictionary of colors be set for the mods found in the table for the MM/ML tags in the SAM tag format docs. It might make sense to allow new colors for unknown single letter codes, but I would support fixed colors for the "standard" modified bases to make screenshots "standardized".

This was a bit of "stream of consciousness" thoughts on the matter, but hopefully somewhat helpful for the discussion.

jrobinso commented 2 years ago

@marcus1487 Thanks for the input. Could you clarify what you mean by this, in particular what is "the implied canonical base probability". I interpret the ML value to be the probability of the modification, I thought for these purposes we were taking the base itself as a given, in any event I'm not sure what this means.

First is that with more than one mod against the same base (e.g. 5mC and 5hmC) IGV (or any software) will need to aggregate the ML values over all mods to compute the implied canonical base probability.

WRT colors, "color by base modification" is a specific mode, coloring for snps (the default) is another mode, they are not mixed. It would be too confusing to do so, handling that many colors is just not something easy to interpret. When selecting "color by modification" mismatches are shown, in a subtle way not using color, in the alignments but not summarized in the coverage track. So I'm having trouble following your discussion of coloring the "canonical base".

If there are 2 modifications of a base called. (e.g. C+mh), I had thought by default we would try mixing the color components for each modification based on relative probability, not sure how to alpha shade, i.e. what the probability of the combination is. Perhaps this is what you mean by "canonical base probability". Perhaps its just the sum, i.e. of 1 is 0.4 and the other is 0.4, its 0.8 that it is modified (by 1 or the other). The spec implies this when it says they should not sum to more than 1.

I am waiting for real data to experiment. Haven't thought about this too deeply.

marcus1487 commented 2 years ago

This may be my opinion, but in my interpretation with more than one alternative to a canonical base (e.g. 5mC and 5hmC) the probabilities need to be taken together. For example it would not be logical to have a 90% probability of both 5mC and a 90% probability of 5hmC. Given this it follows that, for example, C, 5mC and 5hmC probabilities must be considered together. For Nanopore models, we will always consider multiple alternatives to a canonical base together. Given that probabilities adding to greater than 1 are not logical, I don't think it much matters how these are interpreted by a genome browser. I hope this line of logic also makes it obvious how the canonical base probability is inferred from the alternative base probabilities associated with that canonical base.

Your explanation of avoiding snp and modbase visualization makes perfect sense. I agree visualizing both at one would be far too confusing. I do still think that matching the canonical base color with the canonical base color fit snps and color in the reference bases makes sense. I would thus support changing the 5mC red color to something other than the "standard" canonical base colors. I hope this makes some sense.

I don't think the probability of multiple modified bases should be mixed via alpha values. As mentioned before I would suggest that only a probability >50% be colored and that would thus only be a single modified base (or the canonical color in the 2-shade setting). Again I think this all fits with the idea that all modifications should be considered together.

Hopefully this helps clarify things a bit, but do let me know if there further clarifications that would be helpful.

jrobinso commented 2 years ago

The red/blue scale came from the long established (in IGV at least) bisulfite scheme. Does your suggestion that only modifications with probability > 50% get a color contradict this scheme which we have discussed at length in #1133? I thought we had reached consensus on that, for 5mC at least. Or are you suggesting this more generally for other modifications?

For methylation at least it makes some sense, because lack of methylation (or in cancer studies loss of methylation in cancer samples), is an important finding. So calling attention to a CG site that is not methylated makes sense. I have no experience with the other modifications.

jrobinso commented 2 years ago

Or, perhaps you are saying in the 2 modification case that the modification with ML > 50% be the one that is colored, other ignored for visualization purposes? That still leaves the question of what to do with a 5mC with probability 5%, in the interpretation discussed in #1133 this means a strong probability of no methylation (or no 5mC), thus a strong blue color.

It will probably be more fruitful to have this discussion when we have data, unless it impacts #1133 which will probably be released before then.

marcus1487 commented 2 years ago

Does your suggestion that only modifications with probability > 50% get a color contradict this scheme which we have discussed at length in https://github.com/igvteam/igv/issues/1133? I thought we had reached consensus on that, for 5mC at least. Or are you suggesting this more generally for other modifications?

No, I don't think my proposal changes #1133 except for thinking about the color choices in the long term (I will expand on this below). I am offering an extension of this scheme to other modifications. In the simple 2 mods case the 5mC probability implies the canonical probability. When more than 1 modified base is annotated the modified bases and the corresponding canonical base need to be taken into account together. For example a tag indicating 5% probability of 5mC and 60% probability of 5hmC implies a 35% probability of C. I don't think it would make sense to interpret the three probabilities separately as this would leave 2 different implied probabilities for C and I'm not sure how one would work with that for visualization.

For the coloring issue which may effect #1133, I am thinking long term when a read may have modified bases annotated against all canonical bases. To my mind it would be quite confusing with the two color scheme to have the canonical colors not match the canonical colors from the SNP/reference visualization. I understand that bisulfite visualization used red and blue historically, but there were no prospects of including a modified T annotation at this time so the ref color did not matter nearly as much. With the advances in technologies detecting many modified bases from the same strands is likely not too far off. So I would propose that for the 2-color schemes that we reserve the standard canonical base colors for the "high probability canonical" colors. I hope this is making some sense.

Or, perhaps you are saying in the 2 modification case that the modification with ML > 50% be the one that is colored, other ignored for visualization purposes?

Yes. This is what I am saying. And if no modified base is >50% then the base will look the same as "uncalled" since very little is known about the modified/canonical status of the base.

That still leaves the question of what to do with a 5mC with probability 5%, in the interpretation discussed in https://github.com/igvteam/igv/issues/1133 this means a strong probability of no methylation (or no 5mC), thus a strong blue color.

In the single mod case I agree. A 5% 5mC probability means 95% canonical C and should be colored strong blue. Hopefully I explained the rest of this issue above.

It will probably be more fruitful to have this discussion when we have data, unless it impacts https://github.com/igvteam/igv/issues/1133 which will probably be released before then.

I agree, I did not get to this. I will try to get some data together for the 2 color mods as this is much more simple (already in the production caller). For the duplex caller this will likely be a month or so out (summer vacations etc).

jrobinso commented 2 years ago

Ok I think I see what you're saying now. Let me try to restate. You envision a color scheme to reflect the state of a base, either unmodified, or one of several possible modification states. I think this could be another color scheme choice. The red/blue scheme is NOT trying to reflect that, rather it is reflecting the probability that the base is methylated or not methylated. Its a two state model (stating the obvious). Luckily perhaps "C" is associate with blue, but red of course is associated with T.

Your model has a lot of appeal and could be generalizable, but at some point you run out of colors. 4 colors including 3 primary colors are taken by A,T,C, and G. Worth working on but it will take some time I think. I'm inclined to leave the red/blue option as a methylation visualization for 5mc, this is a really important special case and might be the dominant case for some time. The new scheme can be a different coloring option. I don't think this will confuse anyone.

We might rename the red/blue 5mc coloring option to something more specific.

marcus1487 commented 2 years ago

You have correctly summarized the issue and my motivations for making the change now versus in the future. As long as we are on the same page (which it seems) I am okay with retaining the red-blue for historical reasons. I just wanted to make sure the point was raised as this does seem a logical point at which to start this "new standard" with less confusion. I am happy with your choice either way.

jrobinso commented 2 years ago

Yea perfectly logical, but so far we don't have a proposal for a new standard. I don't think releasing the updated red/blue scheme precludes a better standard later, especially since there is nothing new about it. This is just the methylation focused display with a different technology used to detect it. I do think we might rename the option to something more specific, not a general "base modification" option.

amwenger commented 2 years ago

To add my opinion, I think what @jrobinso and @marcus1487 are discussing makes good sense and that developing a scheme to show multiple modifications in the same reads will become important. @marcus1487 already has some models that can generate that data, and we are working on it too.

I like Jim's proposal to release what is implemented already and leave this issue to discuss the more general solution for the future. Could we have sub-settings for Base Modifications like there is for "Bisulfite mode" today? Perhaps we could add "Base modifications > General" and "Base modifications > 5mC" as the two options today. The "General" mode would be the currently released feature, the "5mC" would be the blue<->red scheme, which - in my opinion - is much better for today's PacBio data.

jrobinso commented 2 years ago

@amwenger yes something like that. We have a few weeks to think about what to name the menu, and discover any bugs or issues in the methylation-view branch. I'll be traveling and will look to release this on my return in early July.

marcus1487 commented 1 year ago

Copying this message from #1186 as it is really a discussion relevant to this thread.

In my opinion these complications are the main reason a general modified base mode is preferable to a 5mC or bisulfite-emulation or other "special case" mode. Having thought through the proposals/issues raised by @jrobinso I think I have a proposal for the general modified base case. Additionally I would propose dropping the 5mC/bisulfite special case and transferring these improvement to the primary modified base mode.

I'll separate my thoughts into two sections (1 for per-read display and 2 for the aggregated display).

Per-read modified base display a. Core idea here would be to display the most probable base (modified or canonical) more prominently and ambiguous (or less probably calls) less prominently. b. To be more specific, I would propose that any modified or canonical call with probability >50% be colored as in the 5mC mode (alpha gradient over 50-100%). For sites with no call >50% the alpha 0 (grey) would be used (e.g. C/5mC/5hmC all have 33% probability). c. One important point here is that only canonical bases for which an alternative is presented should be colored. For example, when the 5mC/m and/or 5hmC/h calls are provided only canonical C bases should be colored. If 6mA/a calls are provided canonical A calls should be colored. If 5mC/m and 6mA/a bases are provided then both canonical A and C calls should be colored. i. There may be tricky cases here given the nature of the tags. What should happen when different reads in the same region contain different tags? Scan the whole region for all mod tags? ii. Another complication (related to the next/aggregated section) is canonical (or modified bases) at non-reference sites. I would propose that for the per-read visualization basecalls be shown whether they match the reference base or not.
Aggregated visualization a. For the general purpose aggregated visualization the main issue is the "strandedness" of modified bases. In the initial visualization mode this resulted in modified bases only covering half of the "coverage" bar. In the 5mC mode the full bar is colored according to the strand with the "potential modified base" (but leads to complicated edge cases https://github.com/igvteam/igv/issues/1185 ). To my mind the best solution here is to have the aggregated bar be split into forward and reverse sections. Given that modified bases are uniquely stranded compared to most data types I think this is warranted. I do understand that this is likely a larger burden on the implementation side, but hopefully it is apparent that this is a much more natural presentation of this data type. (Bonus points for bars going up for forward strand and bars going down for reverse strand, but this may just be a personal preference) b. As noted in several issues the height of the full bar should represent the full coverage from all stranded reads (matching reference or not) and the modified/canonical bases should represent only the portion of the full bar attributed to the applicable modified or related canonical base. c. I could see merit in showing modified bases (and canonical equivalents) matching to the mapped reference bases only. I would worry that showing all modified and canonical bases might confuse the visualization mode too much (merging modified bases and SNPs in a sense). This "full modified" visualization could be triggered via a flag similar to the "show all bases" tick box for canonical basecalls. This would result in an aggregated visualization that would match the per-read proposal above, so might be a good option. Generally SNPs should be rare so this distinction likely does not effect so many cases.

Hopefully this proposal/description is helpful. Let me know if further clarification would be helpful or if others have tweaks which may improve the visualization.

Also note that this proposal requires a selection of colors for modified bases which does not clash with the canonical colors. This is of particular interest as red is currently used for canonical T (in SNP mode) and canonical C (in modified base mode to match bisulfite mode). As I've suggested before, I think that the canonical T color is more widely used and this would be a good time to set a new "standard color palette" for modified bases.

Yijun-Tian commented 12 months ago

Just a "yes" vote for @marcus1487 comments about the binary color scheme for per read modification by MM/ML tags, and the stranded modification frequency on the coverage panel. Both are very necessary.

jrobinso commented 12 months ago

@Yijun-Tian First, as mentioned above no progress can be made on this ticket until representative datasets are availabe for multiple modifications (beyhond 5mC, 5hmC). Currently I have multilple 5mC/5hmC datasets available and a single 6mA dataset, from PacBio data. So comments are speculative at this point.

A bi-color scheme for all possible modifications would seem to me to be unwieldy, requiring a large number of colors, and if multiple modifications are present in a single file I think it might be hard to interpret.

Could you elaborate on why stranded frequency? Its a bit of a mystery why so many datasets, in fact all 5mC datasets that I have seen, have C+ modifications without corresponding G-. The single 6mA dataset I have reports both A+ and T- at more or less equal frequency. In at least some cases this seems to be a matter of convention for CpG site calling, the CG site is called as a unit and the result reported only for the C+ strand. Are there other reasons you would expect some sort of strand bias?

Yijun-Tian commented 12 months ago

Hi @jrobinso , Thanks for your thoughtful consideration when parsing the base modification color scheme. For the 6mA nanopore data, I can provide a sliced of BAM for your development. Please check the attachment. The main reason bi-color scheme is necessary is that the per-read modification would ideally tell yes or no answers for the interest modification at each position, even a large proportion of base modification probabilities in the MM/ML tag falls into the grey area for some types of modification, such as 6mA. The blue and red color would be OK for all modifications when using arbitrary thresholds. Such a function to switch on the binary color mode in the browser would provide a good opportunity for users to visually optimize their threshold. For multiple modifications in one position, I would suggest the users to split the BAM file before loading to the browser. For the stranded frequencies, I think it has something to do with the users' expectations and project specific initiatives. For 5mCG in endogenous human genome or introduced by enzymatic (M.Sssl, M.CvPI), or dam introduced 6GmACT, people would expect to see the per-site combined methylation, since these modification occurs in symmetric manner. But in some situations when the enzyme's manner is unclear, the stranded summary is necessary, such as 6mA triggered by EcoGII enzyme. For the half coverage bar, I feel this will be solved after binary the base call in the alignment, since it will be calculated and colored between the modified (red) and canonical (blue) bases, and the users would be able to perceive the grey part coverage is from other strands. 6mA.zip

jrobinso commented 12 months ago

I still don't get the point you are making about strand, but perhaps with an example dataset illustrating it I might. DNA is of course double stranded.

I do get it, I think, re 2 colors -- if I understand you correctly the point is to distinguish between strong evidence that the modification is not present, versus unknown or not measured likelihood.

Thanks for the 6mA data, its very useful to have data from both platforms.

I do not have time to develop features for base modification display based on speculative situations, and don't have the biological background to do such speculation. Features will be considered when there is specific data relevant to it, after getting consensus from the broader community. This (base modification) is a very interesting and important development, but I unfortunately have finite time.

jrobinso commented 12 months ago

@Yijun-Tian And thanks again for the dataset, but it is too limited to be useful for coverage track development. The read sequences are dominated by "N", depth is very low (from 1-4 counts), in short this is not representative. It is useful for verifying basic functionality though.

I do have a question, and this is quite important actually. Why are there no "T-a" calls for the 6mA modification? There are only 'A+a" calls. This makes no sense to me, how does this situation arise?

Yijun-Tian commented 12 months ago

@jrobinso . Thanks for the quick response. I am happy to share more alignment in a dense sequenced region for development uses. For no "T-a" call question, it's been a while since I tried to parse the MM/ML tag by myself, and please correct me if I am wrong. The MM:Z tag refer the modified bases position to the as-sequenced orientation, which is different from SAM file convention to call in REF direction. Therefore you will always see C+m and A+a in the field. Below was copied from SAMtag specification:

marcus1487 commented 12 months ago

For nanopore simplex data (currently the primary data source), only the template strand is observed. Thus we only have observations on the forward strand of the read (A+a tag). Pacbio and duplex nanopore data observe both strands of the resulting read and thus G-a tags would be present. I think this is somewhat pertinent to the discussion of the stranded mod aggregation track, but not the primary point.

For "duplex" modified base called reads (A+a and G-a calls) the single read contributes to both forward and reverse strand modified base calls, but only once to general coverage. For simplex called reads modified bases are only called on one strand of the read. Thus it would be much more intuitive to have modified base aggregation tracks separated by strand as it is intrinsically a stranded entity while read coverage is most often use without concern for the strand. A duplex read contributes to the forward and reverse modified base calls, but only once to the coverage track. Thus it seems that separating coverage from aggregated modified base calls would be more logical (though does start to get a bit busy; but this is configurable).

The larger use case for the stranded aggregation track is where reads have either "A and T" or "C and G" modified base calls. This would result in a situation where the current single aggregation track would not be sufficient. Nanopore do not currently have released models for this state (though users may generate their own and produce basecalls with these modified base tags). As such this is likely a much lower priority IMO.

The more pressing feature request for me would be the diverging color scales. For models with less confident calls the visualization is currently quite misleading (especially when zoomed out). The issue is that for example a 10% modified 6mA call intuitively represents a quite confident canonical call, but with the color scale from 0-100% shows up as slightly shaded (visually indicating some chance of being modified). It would be much more intuitive to have a diverging color scale for any modified base (completely eliminating the single color scale option). This is exaggerated when zoomed out as it can quickly appear as though "everything is modified" for a sample that when analyzed (and carefully considering probabilities) shows quite low levels of modified bases.

As discussed above I understand the issue with "running out of colors". This is why I was/am opposed to the 5mC(red) -> canonical (blue) scale since this is already sharing colors claimed by canonical bases. User settable colors for mods with some reasonable defaults would be ideal, but having the diverging color scales for all modified bases would be a great improvement IMO. If an example dataset would be useful here I am happy to send one offline.

jrobinso commented 12 months ago

Thank you @marcus1487 for the explanation of G-/T- calls. This needs to be considered of course to properly estimate the likelihood of a modification at a particular site. In the "simplex" case only ~1/2 the reads at a given reference site have any possibility of reporting a modification. I had not seen a "duplex" dataset until recently, I will now need to consider this. Its easy to detect from the presence or absence of the complementary (e.g. G-) calls in the tag.

The red/blue color scale is not going away for 5mC, its too well established from bisulfite sequencing. You of course don't have to use it if you don't like it. It probably needs a better name in the menu.

I do understand the need for diverging scales for other modifications. Using the existing canonical base color for the "unmodified" state has some appeal, but you need complementary contrasting colors for all possible modifications to that base, and these colors can't be close to the color of any other canonical base. You also need to consider the effect of alpha on contrast. With only 2 or 3 modifications (m, h, and a) you can maybe do it, but its not going to scale. Alternatively, you could consider a color (blue for example) to represent any non-modified base. You then only need a set of colors for modifications that has good contrast with a single color (blue). This is easier.

Then again, I am imagining a single bam file with many different modifications present. Is that reaslistic? If the file has only a single modification, or maybe 2, it might get easier.

Time is very limited for base modification development, in fact until this thread kicked off again I had planned to wrap up today. If we can agree on a scheme, including colors, for a generalized 2-color representation of all bases I can implement that now, or else it will need to wait for some time.

BTW currently I have no indication that anyone other than a few specialists at the vendors are using IGV for anything other than 5mC, and possibly 6mA, modifications. Of course not everyone contacts us and this could be amiss.

And yes, any datasets you have for modifications other than "simplex" 5mC would be helpful. I very recently got a "duplex" 6mA dataset.

jrobinso commented 12 months ago

RE color schemes, one possibility for default would be "blue" for unmodified, and the existing colors for modifications, with user configurability. This doesn't require much thought, and doesn't upset users who might be used to the existing colors. Of course I have no evidence there are any such users, mabye that's not a big consideration.

jrobinso commented 12 months ago

@marcus1487 Are ONT pipelines specifying "." or "?" for 5mC calls at CpG sites? Or more generally, is it specifying this optional parameter at all? Currently this optional flag doesn't really have any impact on alignment viz since no-call and 0% calls look the same, but it will have a big impact for 2-color schemes. I still see datasets where this isn't set, which according to the spec should be interpreted as 0% likelihood of modification (i.e. .).

marcus1487 commented 12 months ago

All of our current callers will output the mode explicitly (either . or ?). The 5mC CG-context models will use the ? mode since only the CG-context sites will be called while the 5mC and 6mA all-context models will use the . mode.

The older files without a specifier for the CG-context calls are certainly problematic. We've started working on a tool called modkit which aims to be the sort of bedtool/samtools for modified base tags. One of the commands included there allows for the conversion of these files without a specifier (modkit adjust-mods input.bam output.bam --mode ambiguous). You might point users to this for older files.

marcus1487 commented 12 months ago

RE color schema: I think a single color for canonical (of any type) is quite reasonable. Making the canonical and the modified base colors configurable would be a bonus, but I'm in favor of anything which gets the diverging colors scale into the next release, so if there are solutions with less dev effort I would not be opposed.

marcus1487 commented 12 months ago

Then again, I am imagining a single bam file with many different modifications present. Is that reaslistic? If the file has only a single modification, or maybe 2, it might get easier.

Our aim at nanopore is to have the default basecaller output 5mC, 5hmC and 6mA in all contexts.. I don't have a great sense for how large the user base is for many more than 3 modified bases. I would guess that a solution that works for 3 modified base would be sufficient for the medium term future at least.

jrobinso commented 12 months ago

OK, thanks for the input. This is becoming clearer now.

igvteam / igv

Improve support for long-read methylation marks beyond 5mC (C+m) #1155