igvteam / igv

Integrative Genomics Viewer. Fast, efficient, scalable visualization tool for genomics data and annotations
https://igv.org
MIT License
632 stars 379 forks source link

Long read methylation display #1133

Closed mschatz closed 2 years ago

mschatz commented 2 years ago

Here are a few ideas from @amwenger on improving the display of methylation from long reads:

If it would help, I would be interested to contribute ideas on how to process and view methylation data in IGV and agree that implementation should support the standard tags regardless of sequencing platform. I had submitted https://github.com/igvteam/igv/issues/1095 to make some small changes to IGV’s implementation. After spending some more time with PacBio methylation data in IGV, my current ideas are:

jrobinso commented 2 years ago

I'm going to start work on this and related issue(s) tomorrow. So far I've only found 1 currently open, #1095.

Sometimes a significant portion of time on any given feature is spent looking for suitable test data. If you have or know of any please point me to it. It would really be cool if we had methylation data for bisulfite sequencing and the new direct base modification measurement method(s) for the same sample, but not essential.

amwenger commented 2 years ago

direct base modification measurement

Alignments to GRCh38 of HiFi reads from HG002 with the MM/ML tags are at https://downloads.pacbcloud.com/public/dataset/HG002-CpG-methylation-202202/HG002.GRCh38.haplotagged.bam

There is matching bisulfite and nanopore datasets referenced at https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02529-2#availability-of-data-and-materials. I will look for links to those datasets but do not have it offhand.

mschatz commented 2 years ago

My postdoc @oushujun prepared a tarball with a 1MB sample from maize B73 with HiFi reads aligned in a bam, fasta, gff, and bed information here: https://data.cyverse.org/dav-anon/iplant/home/sigma/primrose_example.tar.gz

Let us know if you have any questions!

Mike

jrobinso commented 2 years ago

Thanks for the data and data pointers.

jrobinso commented 2 years ago

@mschatz @amwenger I'm trying to interpret your mockup figure above. What does the blue represent? I ask this because I just reviewed IGVs treatment of bisulfite sequencing. This was implemented many years ago and not by me, but by Ben Berman, so I was rusty on the details. In any event, bisulfite mode in IGV has some special characteristics. First, only "CG" dinucleotide sites are considered, actually a few other "contexts" are optional but in every case only certain sites are in play, so to speak. At these sites the color blue is used for sties deemed not-methylated, determined by looking for C->T (+) or G->A (-) transitions. If there is no change at these sites they are deemed methylated and the color red is assigned. So in bisulfite sequencing red ==> "no signal".

For base modification we only color bases "with signal", and 5mc == red, actually any "m" is red. This is consistent, however we don't color bases with no signal. For a view equivalent to the bisulfite sequencing mode a "C" on the original top strand, or a "G" on the original bottom strand, at a CG site with no 5mc base modification would be blue. Thus the question on meaning of blue in your mockup. Such a view might well be useful, especially for comparison with bisulfite sequencing, but might be a special mode such as "CG methylation view".

jrobinso commented 2 years ago

@mschatz @amwenger I'm going to expose my ignorance here and not for the first time. With Illumina data it is a bit complex and not always possible to determine the "original strand" of the molecule sequenced from a read. It depends on the protocol. I'm sure you're aware of this, are there any analogous issues with common Pacbio and ONT experiments, or can one take the read strand to be the original strand in all cases?

mschatz commented 2 years ago

PacBio HiFi reads are "circular consensus sequencing" so both strands are read and the final sequence is reported with an arbitrary orientation. So for HiFi the strand of the read is not meaningful. For ONT, it is a single pass reading of a molecule so the strand is relevant.

Does that make sense?

Mike

On Tue, May 31, 2022 at 8:41 PM Jim Robinson @.***> wrote:

@mschatz https://github.com/mschatz @amwenger https://github.com/amwenger I'm going to expose my ignorance here and not for the first time. With Illumina data it is a bit complex and not always possible to determine the "original strand" of the molecule sequenced from a read. It depends on the protocol. I'm sure you're aware of this, are there any analogous issues with common Pacbio and ONT experiments, or can one take the read strand to be the original strand in all cases?

— Reply to this email directly, view it on GitHub https://github.com/igvteam/igv/issues/1133#issuecomment-1142889401, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABP344KXE7KRWME56NHZGLVM2WTZANCNFSM5TFA6P3Q . You are receiving this because you were mentioned.Message ID: @.***>

jrobinso commented 2 years ago

@mschatz Yes, that makes sense, I think I knew that at one time. What's behind the question is the previous one, the figure that started this thread has a red/blue scheme that looks analogous to bisulfite mode coloring, where red would indicate methylation at a CpG site, and blue the lack of methylation, as evidenced by the conversion of a cytosine (or not). Only cytosines are informative so we have to know the strand to complement or not.

This scheme could be used for alignments with base modification tags, I suppose, by checking for the presence of a modified base (5mc, others?) at each CpG site. If present the base gets red, if absent the base gets blue. Perhaps the CG dinucleotide gets treated as a unit. Anyway is this the idea behind the red/blue screen mockup at the top of this thread?

amwenger commented 2 years ago

For base modification we only color bases "with signal", and 5mc == red, actually any "m" is red. This is consistent, however we don't color bases with no signal. For a view equivalent to the bisulfite sequencing mode a "C" on the original top strand, or a "G" on the original bottom strand, at a CG site with no 5mc base modification would be blue. Perhaps the CG dinucleotide gets treated as a unit. Anyway is this the idea behind the red/blue screen mockup at the top of this thread?

The PacBio methylation calling does (currently) treat CpG sites in the reads as a unit assuming identical methylation status of the Cs on both strands of the CpG. A MM/ML entry is output for every CpG site and only in the read orientation (i.e. "+" strand in MM nomenclature). There is no "with signal" (red) and "no signal" (blue) binary split. Instead, there is a probability of methylation: high likelihood (i.e. high ML value) or low likelihood (i.e. low ML value). In the diagram, I was trying to suggest that high likelihood (>50%) would be show in red and low likelihood (<50%) would be shown in blue.

jrobinso commented 2 years ago

@amwenger Thanks Aaron, that makes sense, I will dig into the PacBio data you sent me a link to and it will hopefully be clearer to me (e.g. what is the difference between gray and blue). BTW I found the fastqs for the HG002 bisulfite data you referenced, but no alignments (BAM) files. It would be interesting to compare with but not critical.

amwenger commented 2 years ago

e.g. what is the difference between gray and blue

In the illustration, gray is the background color which is applied to non-CpG positions (e.g. to the A/G/T bases or to Cs that are not in CpG). With the diverging color scale, it also ends up as the color for CpG with exactly 50% chance of being methylated since the color fades from full blue (0% probability of methylation = 100% probability of not being methylation) to glue (50%/50%) to full red (100% probability of methylation).

This personal branch (https://github.com/igvteam/igv/compare/master...amwenger:amw-methylation-view) implements something like what this ticket suggests and might give some ideas.

BTW I found the fastqs for the HG002 bisulfite data you referenced, but no alignments (BAM) files. It would be interesting to compare with but not critical.

If you point me to the FASTQ, then I can try to help with alignment. @wac - Do you have any HG002 bisulfite alignments to GRCh38 already?

jrobinso commented 2 years ago

@amwenger See the screenshot below, colored by the current scheme. The lighter reds have lower probability, and would be blue in your scheme. However there a number of reads with no call at all here, which I would interpret to mean no detectable 5mc modification. Should this be interpreted as "0" probability, and be the deepest blue? This leads to the natural question, should we only do this for CpG sites? Otherwise there is blue everywhere. Another question, which I hinted at earlier, should IGV treat CpG sites as a unit when in this mode? i.e the mode would be "CpG methylation", or something like that, and both bases painted the same. In general there can be modifications anywhere, so this wouldn't be a hardcoded option.

Well your latest message just came in, I'll look at your branch and maybe I can answer some of my own questions. Will send fastq pointers in a follow on, I found them at SRA.

Screen Shot 2022-05-31 at 9 09 42 PM
wac commented 2 years ago

Sorry, we don't have any WGBS alignments for the HG002 sample on hand, although I could grab the fastq and run the alignment if that is helpful for troubleshooting.

jrobinso commented 2 years ago

Possible fastq link of interest?

https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR13051250

jrobinso commented 2 years ago

@amwenger I created a PR of your branch just because the github API makes it easy to see diffs, I will check it out locally but am not treating that as a real PR (unless you want me too).

amwenger commented 2 years ago

number of reads with no call at all here, which I would interpret to mean no detectable 5mc modification. Should this be interpreted as "0" probability, and be the deepest blue?

No. For HiFi reads, absent MM/ML annotation is "missing information" not 0 probability of methylation. There are two reasons that a HiFi read would lack MM/ML:

Since the MM/ML tags are explicit representation - different than bisfulfite where the information is implicit in the base sequence - I would treat unannotated positions as "no information" (i.e. not color them and not count them towards numerator or denominator in coverage track) rather than "no methylation".

Should IGV treat CpG sites as a unit when in this mode?

This is right for today's HiFi data where CpG sites are treated as a unit, but I agree it should be an option since the MM/ML tags support strand-specific annotation (i.e. MM entries on + & -) and both PacBio and ONT can in concept detect different methylation signals on the two strands.

amwenger commented 2 years ago

@amwenger I created a PR of your branch just because the github API makes it easy to see diffs, I will check it out locally but am not treating that as a real PR (unless you want me too).

Sounds good. I did not intend it as a real PR but am happy to explain changes and have you merge what is useful.

jrobinso commented 2 years ago

No. For HiFi reads, absent MM/ML annotation is "missing information" not 0 probability of methylation. There are two reasons that a HiFi read would lack MM/ML:

  • The dataset was not processed with the methylation calling software (i.e. it is not a methylation dataset)
  • The HiFi read is derived from fewer than 4 subreads (i.e. not enough information to call methylation)

Ah, o.k., well that certainly simplifies things. If I can summarize then all bases at a CG site should in principal have base modification scores, if they don't its missing information. I would just point out this is the consequence of a particular calling pipeline, it is not and cannot be derived from the SAM spec. I don't know at the moment if this is true of ONT software or not. So it will definitely be some optional mode of coloring, not generic "color by base modification". Only task now is to name it. "Color by ????"

wac commented 2 years ago

I hesitate to suggest since it likely makes things more complicated, but you could have a "color by SAM tag" mode that could have a preset for pacbio methylation, but is a general use values from tag X and color bases from Y color to Z color over a certain numeric range.

jrobinso commented 2 years ago

@wac that might be useful but not applicable here, the MM/ML tags are very special beasts.

jrobinso commented 2 years ago

Comment from @a-slide from an earlier thread (#945):

I think the Red and Blue codes for bisulfite made sense, as it was quite intuitive and bisulfite seq gives a clear binary choice. With this new spec, there is now a modification likelyhood score for each bases, so instead of representing as red or blue, I would probably suggest to use a light to dark color scale for each modification. I am not aware of a color coding convention for DNA/RNA mods

I think the way to handle this in general is to just let the user define a color scale for each modification, which some pre-defined scales available such as red/blue. Perhaps it sufficient to narrow the choice to 2-color heatmap or monoscale from light->dark using alpha transparency. This would be set in preferences.

The other aspect is treating CpG sites, and possibly other contexts (e.g. CHG), as units. This was raised in an earlier issue but I can't find it. This is mostly important in computing % modification at a site, whether represented as coverage or a heatmap. For example, a CpG site that is 100% methylated will appear only ~50% with the current scheme, as only the strand with "C" for each pair will have a modification.

amwenger commented 2 years ago

I think the way to handle this in general is to just let the user define a color scale for each modification, which some pre-defined scales available such as red/blue. Perhaps it sufficient to narrow the choice to 2-color heatmap or monoscale from light->dark using alpha transparency. This would be set in preferences.

I like that solution. Since there are so many different possible modifications, one alternative would be to flip that idea: provide 3 or 4 different 2-color scales and let the user specify which modification (e.g. "Cm") to use for a given scale.

The other aspect is treating CpG sites, and possibly other contexts (e.g. CHG), as units. This was raised in an earlier issue but I can't find it. This is mostly important in computing % modification at a site, whether represented as coverage or a heatmap. For example, a CpG site that is 100% methylated will appear only ~50% with the current scheme, as only the strand with "C" for each component will have a modification.

In my opinion, it is okay for IGV not to treat CpG sites as units. Since the MM/ML tag allows annotation of both read and complement strands, we could have the PacBio caller output the probability calls twice (once for each strand) since that is how our model treats things.

To fix the issue of calculating the % modification, could you make the denominator be only positions with data for a given modification? In other words, rather than dividing by total coverage, divide by reads that have the modification at a given position.

jrobinso commented 2 years ago

To fix the issue of calculating the % modification, could you make the denominator be only positions with data for a given modification? In other words, rather than dividing by total coverage, divide by reads that have the modification at a given position.

Yes I suppose. One thing I had not appreciated is there is a difference between no call and a call of 0% probability. I had imagined some unbiased analysis of the entire read for evidence of a modification, and where found it would be annotated. So the lack of annotation would mean its not found. However I guess that's not how it works, rather specific sites (such as CpG) are analyzed and a probability assigned to each read in the pileup at that location for the presence of the modification. At genomic locations with no MM/ML tags we can't say anything about the presence or absence of the modification. Would that be a fair summary? It can't be the whole story, as the ML tag is optional.

amwenger commented 2 years ago

However I guess that's not how it works, rather specific sites (such as CpG) are analyzed and a probability assigned to each read in the pileup at that location for the presence of the modification. At genomic locations with no MM/ML tags we can't say anything about the presence or absence of the modification. Would that be a fair summary? It can't be the whole story, as the ML tag is optional.

Yes, that is a good summary of how it works for the PacBio HiFi reads.

I did not realize the ML tag was optional. @jkbonfield how should a record with MM but not ML be interpreted?

ctsa commented 2 years ago

One thing I had not appreciated is there is a difference between no call and a call of 0% probability

One of the most useful aspects of the proposed change will be to visually distinguish these two states, right now it is hard to see sites with a low probability of methylation because they blend into the 'no call' background. The red-blue spectrum would make it much easier to see things like allele specific methylation and hypomethylated regions.

oushujun commented 2 years ago

One of the most useful aspects of the proposed change will be to visually distinguish these two states, right now it is hard to see sites with a low probability of methylation because they blend into the 'no call' background. The red-blue spectrum would make it much easier to see things like allele specific methylation and hypomethylated regions.

I would appreciate the no-call sites/reads being colored grey or making it optional for the user. Since IGV is a (great) visualization tool, providing a visual impression of how the actual data look like is very important.

jrobinso commented 2 years ago

@amwenger There appears to be changes to drawing of insertions in your branch, but I can't easily tell if they are real changes or just code rearrangement. Any important changes wrt that?

amwenger commented 2 years ago

I moved the insertion code so it drew after the methylation marks / SNVs. Otherwise the methylation rendering was obscuring the insertions.

On Thu Jun 2, 2022, 07:26 PM GMT, Jim Robinson @.***> wrote:

@amwenger https://github.com/amwenger There appears to be changes to drawing of insertions in your branch, but I can't easily tell if they are real changes or just code rearrangement. Any important changes wrt that?

— Reply to this email directly, view it on GitHub https://github.com/igvteam/igv/issues/1133#issuecomment-1145245091, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABWS3JO555OMA624TSIJPHDVNEDHPANCNFSM5TFA6P3Q . You are receiving this because you were mentioned.Message ID: @.***>

jrobinso commented 2 years ago

I did not realize the ML tag was optional. @jkbonfield how should a record with MM but not ML be interpreted?

@amwenger @jkbonfield this is an important question. In the absence of an ML tag I am currently interpreting MM as a binary presence/absence maker. That is, the modification is present or its absent. So there is no difference between a "no call" and "not there". In the HIFI scheme described by Aaron there is a difference in an MM with ML of 0 => modification is not there, and the absence of an MM tag => nothing can be said about it.

Currently IGV intreprets an MM tag without a likelhood (ML) as modficiation is present, this seems to be the only way to read the spec. For visualization it is treated as a modification with likelihood 100%.

wac commented 2 years ago

I'd forgotten about the ability to specify arbitrary number of modifications per base so the MM/ML data really does need to treated in tandem!

In terms of representing the information stored completely/unambiguously it might be useful to have a distinct color for "MM tag present but no ML value" as a corrolary to the "no tags present" - I think this covers all the possible cases in (valid) data:

Grey = base not tagged in ML or MM (for the base mod requested) "Black?" = MM tag defined but no ML tag for this read Red through Blue = ML value matching MM tag detected and represented

If the colors are user selectable then users could just change "MM only" to Red to still be able to get the other behaviour if desirable. They could also switch Blue with Grey if they wanted to get a monocolor scheme as before.

jrobinso commented 2 years ago

@wac all things are possible but its not established yet that anyone is producing files without ML tags, or files with multiple modifications per base.

jrobinso commented 2 years ago

@marcus1487 @fidibidi @a-slide this is a long thread to wade through, but it would be helpful to get your input on the proposal here. I will try to summarize. For DNA methylation, at least, moving to a blue-red heatmap representation analogous to treatment of bisulfite experiments. Any 5mc base modification with probability < 50% would get a shade of blue, > 50% a shade of red. Exactly 50% would actually be invisible, equivalent to bases with no modification, maybe that's o.k. maybe not. I had previously assumed that MM tags indicated that some evidence of a modification was present, with probability ML, but my understanding now is that at least with PacBio pipelines that is not the case, an MM tag with ML of zero is possible, and anything < 50% should be interpreted as the modification is probably not present. No MM tag at all means the base was not analyzed, or had a small read count, and nothing can be said about presence or absence of the modification. A different viewpoint that has consequences for visualization. Fine as far as it goes, but there is now the case of how to interpret experiments with MM but no ML tags. Perhaps that's a theoretical point unless there is software producing such files. I'm afraid my summary is getting as long as the thread.

jrobinso commented 2 years ago

I've pushed a prototype to the "methylation-view" branch. This prototye adopts the alignment base color scheme from @amwenger 's branch, specifically blue for ML < 50%, red for ML > 50%, alpha-shaded by ML. For coverage a different approach is used, as calls are not categorical no attempt is made to count bases. Rather, color is used to reflect the average ML value at the location. For this average bases with no MM/ML are ignored. The color scale is continuous from blue for ML = 0, to red for ML == 100%, so a heterozygous sight will be purple.

There are no user options here as this is a prototype. For user options I am considering presenting 2 choices, a single color with alpha shading (the current released behavior), or 2 color (this prototype), with user selectable colors for each modification type. This is a fairly simple scheme that can be implemented in the near term.

To try this branch you can either pull and build or download pre-built packages from the links below.
For Macs use the "all platform" option, unzip, and start by right-clicking "igv.command" and selecting "open with > terminal". You'll get a security warning as the snapshot builds are not notarized.

There are some example IGV sessions in https://github.com/igvteam/igv/tree/methylation-view/test/sessions/base_mods

mschatz commented 2 years ago

Awesome! Thanks for posting this!

Mike

On Mon, Jun 6, 2022 at 12:11 PM Jim Robinson @.***> wrote:

I've pushed a prototype to the "methylation-view" branch. This prototye adopts the alignment base color scheme from @amwenger https://github.com/amwenger 's branch, specifically blue for ML < 50%, red for ML > 50%, alpha-shaded by ML. For coverage a different approach is used, as calls are not categorical no attempt is made to count bases. Rather, color is used to reflect the average ML value at the location. For this average bases with no MM/ML are ignored. The color scale is continuous from blue for ML = 0, to red for ML == 100%, so a heterozygous sight will be purple.

There are no user options here as this is a prototype. For user options I am considering presenting 2 choices, a single color with alpha shading (the current released behavior), or 2 color (this prototype), with user selectable colors for each modification type. This is a fairly simple scheme that can be implemented in the near term.

To try this branch you can either pull and build or download pre-built packages from the links below. For Macs use the "all platform" option, unzip, and start by right-clicking "igv.command" and selecting "open with > terminal". You'll get a security warning and the snapshot builds are not notarized.

There are some example IGV sessions in https://github.com/igvteam/igv/tree/methylation-view/test/sessions/base_mods

— Reply to this email directly, view it on GitHub https://github.com/igvteam/igv/issues/1133#issuecomment-1147624148, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABP343HFSLPARUSDHYHIZTVNYPJTANCNFSM5TFA6P3Q . You are receiving this because you were mentioned.Message ID: @.***>

amwenger commented 2 years ago

prototype to the "methylation-view" branch

Nice! The blue<->red scheme in the reads looks good to me.

I do not notice the alpha transparency very much with the linear relationship between alpha and probability. I wonder if it would be better to try a quadratic scaling. Something like: alpha = l*l/64 - 4l + 256 (so alpha=256 at l=0 and l=256; alpha=0 at l=128) instead of alpha = abs(127-l)

The color scale is continuous from blue for ML = 0, to red for ML == 100%, so a heterozygous site will be purple.

This is hard for me to distinguish when zoomed out. Could we perhaps use height in the coverage track in addition to the color change (e.g. 50% of height for ML=50%)?

jrobinso commented 2 years ago

I'd rather not use height for anything other than counts, its inconsistent with coverage track generally. We could try a different centroid color, for example copy number heatmaps use. red <-> white <-> blue. I don't think white would work but could do something else. That's more complex for the user though, now there are 3 colors to select if customizing.

jrobinso commented 2 years ago

@amwenger Or maybe I mis-interpret you. We could apportion red and blue bars according to average ML, without changing the overall coverage height. Is this what you mean?

amwenger commented 2 years ago

@amwenger Or maybe I mis-interpret you. We could apportion red and blue bars according to average ML, without changing the overall coverage height. Is this what you mean?

You are giving me too much credit there :-) I did mean to conflate with counts, though I like your suggestion to apportion the height between red and blue based on ML (ML% red 100-ML% blue).

The idea from the first message in this thread was to use a heatmap summary separate from the coverage track. I like that for multiple marks:

image
jrobinso commented 2 years ago

Yes the heat maps are nice, but I have limited time for this at the moment and am looking for incremental improvement. Also, are you producing files with multiple marks, and in particular multiple modifications per base. This is theoretically possible but not handled by IGV right now.

amwenger commented 2 years ago

We are not producing files with multiple marks yet but that is in the plans.

marcus1487 commented 2 years ago

I just got through this thread and can try to add some comments from the nanopore side.

For the discussion around trimmed probabilities (near 0%) vs missing data, nanopore reads originally applied a threshold and only output calls at CG sites. This created the ambiguity that lead to an update in the SAM MM tag spec. With the new MM spec nanopore calls currently use the ? version (missing calls should be interpreted as missing data), though we may explore using the . version (missing calls interpreted as canonical/0% probability of modification) in order to reduce file size once all-context models are released more widely (currently in research release). In any case the tag specifier (? or .) should specify how to handle missing data and old-style data can default to one or the other (I'd prefer ? as the default).

For the sliding scales from 50%->100% and 50%->0% one issue I see with this is the multiple modified base outputs. We have a 5mC+5hmC model for nanopore data (released in December), so this is available now (I'll try to track down some example data for testing). It is possible for a site to have 33% C prob, 33% 5mC prob and 34% 5hmC call. It may not be a bad thing to have these sites revert to the 50% color. This may get a bit more extreme with more modified bases (C/5mC/5hmC/5fC/5caC, though in practice it will often be a non-issue as more samples have one predominant probability).

On the point about context specific data, I would be in favor of letting the MM tag specify the context and not the genome browser. I suspect this is more of a historical artifact in the bisulfite visualization mode and adds unnecessary complexity for the visualization to add this extra filter. I would be in favor of any call from the MM tags being included in the browser. Third party tools may arise to filter MM tags to those at specific contexts for that purpose should the need arise (not too hard to implement a slow version with pysam today).

On the strand question, nanopore duplex data can provide modified base calls from both the forward and reverse strands independently for a single read (nanopore software makes all modified base calls on the single pass signal data and aggregates only from these per-read per-site probabilities). We do not currently have the implementation ready to output modified base call to both strands at the same time, but are working on this now and as mentioned above this is supported by the MM tag spec. So no example data at the moment, but we can ping with this once we have implemented this feature.

For the aggregated visualization I'm a bit conflicted about how to best visualize this. I understand the push not to use height as it suggests coverage, but I think that the height indicating the total number of modified base calls (one way or another) is useful but I also understand that this "alternative coverage" may confuse users. I think that a binary threshold for the coloring in the aggregated data would be good (defined as the highest likelihood for the multi-mod case). This would look very familiar to users looking at SNP data in IGV. I'll have a look at the provided implementation and comment further in a bit.

jrobinso commented 2 years ago

Thanks for the feedback @marcus1487 . Test data is always welcome, I'm starting a collection.

I would prefer to just present modifications as they are found without reference to any context, and that's what it currently does. However the whole issue of "skipped bases / missing calls", ?, and ., muddies the water. How do you determine when a modification call is expected (and therefore skipped) without reference to a context or other expectation of a call? This might not be a problem in practice but it makes for very confusing reading.

Its looking that there will be one set of viz conventions for Pacio data, and an alternative one for ONT, although they won't be called that. This is a bit unfortunate but is workable.

marcus1487 commented 2 years ago

I'm not sure I follow why a different default is required for nanopore and pacbio data. It seems like both data types treat missing calls as missing data using current software. I agree that fully supporting the implications of the ? and . tag specifiers might be a bit of a burden, but I think all current callers use (either implicitly or explicitly) the ? version of the MM tag.

As for the question of where to expect a call, my understanding is that a call for any MM tag is made at the canonical base specified at the beginning of the MM tag. Any non-zero value in the MM tag indicates a skipped base (e.g. C for 5mC or A for 6mA) in the SEQ field. So for the . version of the tag, using 5mC calls as an example, my understanding is that any skipped call (C base) should be interpreted as canonical (there may be some higher threshold, but I think assuming 0% is reasonable). I may also be missing a complication here, but hopefully this explanation helps clear thing up a bit or at least the very least move the collective understanding in a useful direction.

jrobinso commented 2 years ago

@marcus1487 Sorry if I did not express myself clearly, but I think you answered my question. So a 5mC MM tag with "." and no explicit calls (yes an extreme case) would mean that all "C" bases (accounting for strand) in the alignment should be considered called with ML of 0%. So in the blue/red scheme we are discussing for 5mc those bases would be colored blue, i.e. every "C" (or "G" on negative strand) would be blue. In the "?" case those bases would be considered uncalled. IGV is currently implicitly treating all skipped bases as "?".

There doesn't have to be different defaults for different platforms if the users of those platforms agree on visual representation of course. At the moment we are converging on a 2-color scheme for the pacbio data, with ML of 50% being the dividing line for color choice, and ML controlling alpha which approaches 100% (transparent) around 50%. Representation of coverage track is still in flux. In the alternative scheme, which I assume your users are currently using, the base modifications use a single color (red for 5mc) with alpha shading reflecting ML. There is something a bit unsatisfying about the 2 color scheme to represent a continuous value (effectively representing ML), but it does reflect a 2 state interpretation (methylated or not methylated) nicely. Since in reality the base either has the modification or doesn't maybe this is o.k.

jrobinso commented 2 years ago

@marcus1487 @amwenger I think I've got a handle on this, finally, and am implementing support for ? and . However, I want to point out that according to the spec "." should be assumed if neither are specified. See below

When this flag is ‘?’ there is no information about the modification status of the skipped bases provided. When this flag is not present, or it is ‘.’, these bases should be assumed to have low probability of modification.

This is not what IGV is currently doing, it is assuming "?" if not specified, i.e. there is no information about skipped bases). If I change this to be in spec, as I read it at least, if neither "?" or "." are used it will be assumed that skipped bases are equivalent to a modification with low ML (which IGV will set as 0%). The HiFi example data I currently have does not set ?/.

jrobinso commented 2 years ago

And for the coverage track, I think ultimately the heatmaps as mocked up at the very start of this thread is a better solution for summarization. In the meantime I'm just going to try to do something reasonable with coverage until that can be implemented.

marcus1487 commented 2 years ago

At the moment we are converging on a 2-color scheme for the pacbio data, with ML of 50% being the dividing line for color choice, and ML controlling alpha which approaches 100% (transparent) around 50%.

I think the 2-color scheme makes sense for a single modified base and I would support this for any modbam visualization (pacbio or nanopore). The slight complication comes with more modified bases (5hmC etc). I think that an (num_mods + 1)-color scheme is likely the best way to go for this feature defining this visualization over any number of modified bases. Any sites where the maximum probability (canonical or modified probability) is <50% could be alpha=100%. to keep the coloring consistent when more mods are added. It's a bit hacky for multiple mods, but in practice most often the highest probability modified base has quite a high probability for nanopore data at least.

jrobinso commented 2 years ago

To keep things tractable I think the multiple modification visualization should be a separate topic, with good example data.

At the moment IGV doesn't support multiple modifications for a single base at all. Another reason for a separate ticket.

jrobinso commented 2 years ago

@marcus1487 @amwenger Please note the apparent discrepancy in the spec wrt ?/. and current practice. I'm incline to continue assuming "?" by default, despite what the spec says, because this seems to be the practice for both pacbio and ont. Some revision of the spec is in order?

jrobinso commented 2 years ago

@marcus1487 @amwenger Thanks for your patience, without your input nothing would happen on this topic. I am missing test data for the following situations, if you have any let me know. If you don't I'll assume these are only theoretical possibilities and we'll deal with them later. Cases in no particular order

All of the example data I have, as far as I can tell, is. 'C+m' modifications, so really that's all I can confidently say is supported at this time.