Summarizing mutations at positions using color coding - feature request

igvteam / igv

Integrative Genomics Viewer. Fast, efficient, scalable visualization tool for genomics data and annotations

https://igv.org

MIT License

644 stars 386 forks source link

Summarizing mutations at positions using color coding - feature request #336

Open SchragaSchwartz opened 7 years ago

SchragaSchwartz commented 7 years ago

Hello, I am interested in generating a track that summarizes (using color code) the number of nucleotides aligning at each position. In essence it would be identical to the track that is automatically displayed on top of bam files summarizing the frequency of A, C, T and G at each position, but the idea would be to create a much 'lighter' track which would not require loading entire bam files.

I envision two ways of visualizing this track: One possibility - similar to how this track is displayed in bam files - would be to have the position be presented in gray if >90% (a user defined threshold) of the nucleotides match the genomically encoded one, and only otherwise color-code the nucleotides that are present. The other visualization mode would simply color code the frequency of each nucleotide at each position.

Again, as is currently performed for bam files, it would be great if the total size of the bar corresponded to the depth at each position (assuming it is provided by the user), and otherwise set to 1. Would it be possible to generate such a feature?

I think it would be of quite broad use, regardless of whether people are looking at endogenous mutations on DNA or on RNA, or for experimental protocols that end up inducing mutations at DNA or on RNA (e.g. bisulfate sequencing).

Thank you very much, Schraga Schwartz

jrobinso commented 7 years ago

Got it. BTW is the application for this single-cell sequencing?

mgarber commented 7 years ago

Would the ewig file format you created for the pi conservation vector work for this?

jrobinso commented 7 years ago

Hi Manuel! The format might be the right one, but I think Schragi wants to display lots of these tracks together in a sort of flat, heatmap style (not a wiggle bar chart). @SchragaSchwartz , to see what Manuel is referring to load this track by URL from genome hg19: http://www.broadinstitute.org/igvdata/hg19/pi.ewig.tdf

SchragaSchwartz commented 7 years ago

Sorry for only commenting now. @jrobinso The application is quite general, and would be useful for identification of new mutations in DNA sequencing data (you often want to inspect the region in which you found the new mutation, to ensure that you are not finding a bunch of other 'mutations' around the place, typically the hallmark of misalignments), for looking at editing / modified sites in RNA-sequencing data (where distinct types of modification have distinct mutation patterns), for bisulfite sequencing (where you want to look at ratios of conversion) and all of this could be done based either on bulk sequencing or for single cell sequencing.

And indeed, a flat, heatmap style allowing to present all four nucleotides (and to do so for multiple samples) would be ideal.

jrobinso commented 7 years ago

OK. But to clarify, you would want to do this with multiple samples, correct?

How would you assign color? For example a cell with 37 "c", 6 "t", and 2 deletions?

If you are looking for mutation patterns it might be best to only color based on mismatch to reference.

On Thu, Dec 22, 2016 at 11:52 PM, SchragaSchwartz notifications@github.com wrote:

Sorry for only commenting now. @jrobinso https://github.com/jrobinso The application is quite general, and would be useful for identification of new mutations in DNA sequencing data (you often want to inspect the region in which you found the new mutation, to ensure that you are not finding a bunch of other 'mutations' around the place, typically the hallmark of misalignments), for looking at editing / modified sites in RNA-sequencing data (where distinct types of modification have distinct mutation patterns), for bisulfite sequencing (where you want to look at ratios of conversion) and all of this could be done based either on bulk sequencing or for single cell sequencing.

And indeed, a flat, heatmap style allowing to present all four nucleotides (and to do so for multiple samples) would be ideal.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/igvteam/igv/issues/336#issuecomment-268952735, or mute the thread https://github.com/notifications/unsubscribe-auth/AA49HAXRfpQI3on3mwO7F3IZ-YcXtoH9ks5rK32lgaJpZM4LJ4Gu .

SchragaSchwartz commented 7 years ago

Yes, I would be interested in being able to compare such mutational heatmaps from multiple files.

It would be good to leave some flexibility to the user in terms of what is presented. In many cases, users may not care that much about insertions and deletions, and mostly care to emphasize the nucleotide composition; But in other cases they may care about them. So a checkbox providing control over this would be great. And I fully agree that a mode in which nucleotide composition is shown only when it diverges substantially (over a given user-defined threshold) from the reference would be great.

On Fri, Dec 23, 2016 at 6:10 PM, Jim Robinson notifications@github.com wrote:

OK. But to clarify, you would want to do this with multiple samples, correct?

How would you assign color? For example a cell with 37 "c", 6 "t", and 2 deletions?

If you are looking for mutation patterns it might be best to only color based on mismatch to reference.

On Thu, Dec 22, 2016 at 11:52 PM, SchragaSchwartz < notifications@github.com> wrote:

Sorry for only commenting now. @jrobinso https://github.com/jrobinso The application is quite general, and would be useful for identification of new mutations in DNA sequencing data (you often want to inspect the region in which you found the new mutation, to ensure that you are not finding a bunch of other 'mutations' around the place, typically the hallmark of misalignments), for looking at editing / modified sites in RNA-sequencing data (where distinct types of modification have distinct mutation patterns), for bisulfite sequencing (where you want to look at ratios of conversion) and all of this could be done based either on bulk sequencing or for single cell sequencing.

And indeed, a flat, heatmap style allowing to present all four nucleotides (and to do so for multiple samples) would be ideal.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/igvteam/igv/issues/336#issuecomment-268952735, or mute the thread https://github.com/notifications/unsubscribe- auth/AA49HAXRfpQI3on3mwO7F3IZ-YcXtoH9ks5rK32lgaJpZM4LJ4Gu .

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/igvteam/igv/issues/336#issuecomment-269011646, or mute the thread https://github.com/notifications/unsubscribe-auth/AXGjFXy4HmCCo4uZLU4cpNrBzXiWeNopks5rK_J3gaJpZM4LJ4Gu .

jrobinso commented 7 years ago

Hi Schragi, I perphaps didn't state my question well, or mixed too many in one email. For any position there is no single nucleotide "call", but a composition. How would you assign color when there is a mix of nucleotides at a position?

On Fri, Dec 23, 2016 at 12:11 PM, SchragaSchwartz notifications@github.com wrote:

Yes, I would be interested in being able to compare such mutational heatmaps from multiple files.

It would be good to leave some flexibility to the user in terms of what is presented. In many cases, users may not care that much about insertions and deletions, and mostly care to emphasize the nucleotide composition; But in other cases they may care about them. So a checkbox providing control over this would be great. And I fully agree that a mode in which nucleotide composition is shown only when it diverges substantially (over a given user-defined threshold) from the reference would be great.

SchragaSchwartz commented 7 years ago

I was thinking of displaying things very much along the same line as the way it is currently done for bam files. If there are 40 As, 20 Cs 20 Ts and 20 Gs aligning to a certain position, then there would be 4 colors, with the color corresponding to 'A' taking up 40% of the height, and each of the other 3 nucleotides taking up 20% of the height. The total height of the bar could either correspond to 1 (a set value) or correspond to the total coverage at that position (as is performed for the display of bam files). If the user is interested in visualizing insertions and deletions, then they, too, would get assigned a color, and be displayed based on their frequency.

Given that there is a reference base in the genome, for that position, I would still retain the option (which is the implemented for bam files) that if >95% (or any user-specified value) of the bases at a given position correspond to the reference the entire base be visualized in grey.

Finally, it would be nice to be able to have strand-specific control over this track, such that if you if you are looking at the 'forward' strand it will show the bases (and mutations) based on that, whereas if you are looking at the 'reverse' track (by clicking on the arrow by the sequence track) it will complement the bases and the mutations associated with them.

Does this answer your question? I hope we are not responding in parallel to each other. I'd be happy to coordinate a phone call if you think that can help clarify.

Best, Schragi

On Sat, Dec 24, 2016 at 3:21 AM, Jim Robinson notifications@github.com wrote:

Hi Schragi, I perphaps didn't state my question well, or mixed too many in one email. For any position there is no single nucleotide "call", but a composition. How would you assign color when there is a mix of nucleotides at a position?

On Fri, Dec 23, 2016 at 12:11 PM, SchragaSchwartz < notifications@github.com> wrote:

Yes, I would be interested in being able to compare such mutational heatmaps from multiple files.

It would be good to leave some flexibility to the user in terms of what is presented. In many cases, users may not care that much about insertions and deletions, and mostly care to emphasize the nucleotide composition; But in other cases they may care about them. So a checkbox providing control over this would be great. And I fully agree that a mode in which nucleotide composition is shown only when it diverges substantially (over a given user-defined threshold) from the reference would be great.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/igvteam/igv/issues/336#issuecomment-269062341, or mute the thread https://github.com/notifications/unsubscribe-auth/AXGjFUBgV0C0E_FUI05D5drR464jahGzks5rLHO1gaJpZM4LJ4Gu .

SchragaSchwartz commented 5 years ago

Dear Jim, I recently received this notification that the thread was closed, and I am hoping that this may imply that this feature has been implemented (though I realize the other possibility is that you have decided not to implement it). If it has been implemented, I'd love to try it out! We have a lot of need for this. Thank you very much, Schragi

On Sat, Dec 24, 2016 at 2:40 PM Schragi Schwartz schragi@gmail.com wrote:

I was thinking of displaying things very much along the same line as the way it is currently done for bam files. If there are 40 As, 20 Cs 20 Ts and 20 Gs aligning to a certain position, then there would be 4 colors, with the color corresponding to 'A' taking up 40% of the height, and each of the other 3 nucleotides taking up 20% of the height. The total height of the bar could either correspond to 1 (a set value) or correspond to the total coverage at that position (as is performed for the display of bam files). If the user is interested in visualizing insertions and deletions, then they, too, would get assigned a color, and be displayed based on their frequency.

Given that there is a reference base in the genome, for that position, I would still retain the option (which is the implemented for bam files) that if >95% (or any user-specified value) of the bases at a given position correspond to the reference the entire base be visualized in grey.

Finally, it would be nice to be able to have strand-specific control over this track, such that if you if you are looking at the 'forward' strand it will show the bases (and mutations) based on that, whereas if you are looking at the 'reverse' track (by clicking on the arrow by the sequence track) it will complement the bases and the mutations associated with them.

Does this answer your question? I hope we are not responding in parallel to each other. I'd be happy to coordinate a phone call if you think that can help clarify.

Best, Schragi

On Sat, Dec 24, 2016 at 3:21 AM, Jim Robinson notifications@github.com wrote:

Hi Schragi, I perphaps didn't state my question well, or mixed too many in one email. For any position there is no single nucleotide "call", but a composition. How would you assign color when there is a mix of nucleotides at a position?

On Fri, Dec 23, 2016 at 12:11 PM, SchragaSchwartz < notifications@github.com> wrote:

Yes, I would be interested in being able to compare such mutational heatmaps from multiple files.

It would be good to leave some flexibility to the user in terms of what is presented. In many cases, users may not care that much about insertions and deletions, and mostly care to emphasize the nucleotide composition; But in other cases they may care about them. So a checkbox providing control over this would be great. And I fully agree that a mode in which nucleotide composition is shown only when it diverges substantially (over a given user-defined threshold) from the reference would be great.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/igvteam/igv/issues/336#issuecomment-269062341, or mute the thread https://github.com/notifications/unsubscribe-auth/AXGjFUBgV0C0E_FUI05D5drR464jahGzks5rLHO1gaJpZM4LJ4Gu .

jrobinso commented 5 years ago

Hi @SchragaSchwartz , I had something like 300 open issues, plus ~100 in igv.js, and a few dozen in juicebox. Recognizing I will never be able to do all of these stale (old) issues were closed. The thinking was if they are important they will be reopened. So the system has worked. I will re-open this one.

I will point out that this is not an IGV request per se, or exclusively, but request for an off-line bam processing tool + IGV support for the resulting track.

I think @mgarber might be correct on the eWig track, let's start with that anyway. Start IGV, select genome hg19, then File > Load from Server > Annotations > Comparitive Genomeics > sihpy pi. Play around with that track and imagine it displaying sequence counts from a bam file. Is this what you envision?

SchragaSchwartz commented 5 years ago

This track looks perfect! For RNA-seq (or DNA-seq) data it would typically look a lot less colorful, and in that sense it could be nice to implement the kind of option you already have in your coverage tracks that if over X% (where X is defined by the user) of the reads correspond to WT, everything is plotted in grey. But visually these tracks combine all the features I would want - they give a representation of both sequence composition and coverage in a single track.

On Fri, Nov 16, 2018 at 7:08 PM Jim Robinson notifications@github.com wrote:

Hi @SchragaSchwartz https://github.com/SchragaSchwartz , I had something like 300 open issues, plust ~100 in igv.js, and a few dozen in juicebox. Recognizing I will never be able to do all of these stale (old) issues were closed. The thinking was if they are important they will be reopened. So the system has worked. I will re-open this one.

I will point out that this is not an IGV request per se, or exclusively, but request for an off-line bam processing tool + IGV support for the resulting track.

I think @mgarber https://github.com/mgarber might be correct on the eWig track, let's start with that anyway. Start IGV, select genome hg19, then File > Load from Server > Annotations > Comparitive Genomeics > sihpy pi. Play around with that track and imagine it displaying sequence counts from a bam file. Is this what you envision?

[image: screen shot 2018-11-16 at 9 07 15 am] https://user-images.githubusercontent.com/933148/48636188-13f8f000-e97f-11e8-8f80-ce8509ee2fab.png

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/igvteam/igv/issues/336#issuecomment-439460648, or mute the thread https://github.com/notifications/unsubscribe-auth/AXGjFRlIDzoDrsXCpV0eKekGspKLYMDIks5uvvDugaJpZM4LJ4Gu .

SchragaSchwartz commented 5 years ago

Though the one additional possibility it would be nice to have is an ability to hover over the track, and to get summaries of read counts (again, as you have in your track summarizing coverage).

On Fri, Nov 16, 2018 at 8:18 PM Schragi Schwartz schragi@gmail.com wrote:

This track looks perfect! For RNA-seq (or DNA-seq) data it would typically look a lot less colorful, and in that sense it could be nice to implement the kind of option you already have in your coverage tracks that if over X% (where X is defined by the user) of the reads correspond to WT, everything is plotted in grey. But visually these tracks combine all the features I would want - they give a representation of both sequence composition and coverage in a single track.

On Fri, Nov 16, 2018 at 7:08 PM Jim Robinson notifications@github.com wrote:

Hi @SchragaSchwartz https://github.com/SchragaSchwartz , I had something like 300 open issues, plust ~100 in igv.js, and a few dozen in juicebox. Recognizing I will never be able to do all of these stale (old) issues were closed. The thinking was if they are important they will be reopened. So the system has worked. I will re-open this one.

I will point out that this is not an IGV request per se, or exclusively, but request for an off-line bam processing tool + IGV support for the resulting track.

I think @mgarber https://github.com/mgarber might be correct on the eWig track, let's start with that anyway. Start IGV, select genome hg19, then File > Load from Server > Annotations > Comparitive Genomeics > sihpy pi. Play around with that track and imagine it displaying sequence counts from a bam file. Is this what you envision?

[image: screen shot 2018-11-16 at 9 07 15 am] https://user-images.githubusercontent.com/933148/48636188-13f8f000-e97f-11e8-8f80-ce8509ee2fab.png

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/igvteam/igv/issues/336#issuecomment-439460648, or mute the thread https://github.com/notifications/unsubscribe-auth/AXGjFRlIDzoDrsXCpV0eKekGspKLYMDIks5uvvDugaJpZM4LJ4Gu .

jrobinso commented 5 years ago

@SchragaSchwartz There's an important design consideration that will affect how this is implemented. Do you envision the need to see data zoomed out past the level you can distinguish individual bases? Using colored bars instead of letters this corresponds to an ~1kb window. If yes we need to think about how positions sharing a pixel are combined. It also affects the choice of file formats.

SchragaSchwartz commented 5 years ago

I don't anticipate such a need. Typically what we want to do is compare the mutational/snp profile in rna/dna, and for that we zoom in on the specific position or a region of several hundred bases surrounding it.

On Mon, Nov 19, 2018, 08:29 Jim Robinson <notifications@github.com wrote:

@SchragaSchwartz https://github.com/SchragaSchwartz There's an important design consideration that will affect how this is implemented. Do you envision the need to see data zoomed out past the level you can distinguish individual bases? Using colored bars instead of letters this corresponds to an ~1kb window. If yes we need to think about how positions sharing a pixel are combined. It also affects the choice of file formats.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/igvteam/igv/issues/336#issuecomment-439785039, or mute the thread https://github.com/notifications/unsubscribe-auth/AXGjFfjkbZle2MDtLwCVbdvkSF7C2Jy7ks5uwk-wgaJpZM4LJ4Gu .