igvteam / igv.js

Embeddable genomic visualization component based on the Integrative Genomics Viewer
MIT License
643 stars 230 forks source link

New tracktype: multi-scale heatmap #1595

Open biork opened 1 year ago

biork commented 1 year ago

A proposal for a multi-scale Heatmap Track

multi-scale-heatmaps

Motivation

This track type was motivated by multiscale genomic analyses such as https://pubmed.ncbi.nlm.nih.gov/24727652/

A heatmap typically maps continuous data to one of two types of color palettes, depending on the distribution of the data to be visualized:

  1. a "sequential" palette, dark to light or mute to saturated, for data from a one-sided distributions (e.g. count data)
  2. a "diverging" palette with contrasting colors at the upper and lower extremes and a mute color in the center for data from two-sided distributions (e.g. Gaussian)

Implementation Overview

My current implementation tries to balance:

  1. minimizing invasive changes to existing IGV code
  2. minimizing impact on performance
  3. maximizing generality of the new capability (add capability, not policy)
  4. minimizing new configuration requirements on the user

Implementation rationale

Because a heatmap can be thought of as nothing more than multiple rows of densely-packed annotation features (typically without any additional decoration like labels), the FeatureTrack that supports general annotations already has all facilities required to support heatmaps. In particular,

  1. All the behavior concerning feature visibility is used as-is.
  2. Its ability to render multiple rows within one track is used.
  3. Only some small modifications to its behavior were necessary:
    • Input data, not packFeatures, assigns Features to rows.
    • No decoration of heatmap cells is supported.
    • Data is mapped at runtime to colors used to fill heatmap cells (without borders).
    • A default mapping function and several palettes are provided, but all can be overridden in config.

These changes have almost no impact on performance, and what little performance impact there is could be mitigated with slightly more invasive changes.

To keep the IGV implementation as simple (and fast) as possible, the data is expected to be fully preprocessed for display; all IGV does at runtime is map numeric values to colors using a colormap function and a palette.

Concretely, data should be in [0,1]. Values below and above this range are by default clamped to 0 and 1 respectively and thus mapped to the palette's edge colors. Also, two discrete "outlier" colors can optionally be provided in the track config to highlight outlying data instead of just using the palette's "edges" (a very good idea I first saw in matplotlib).

Data is delivered in BED files

Given the preceding characterization of heatmaps, it is natural to deliver heatmap data as BED files with a very minor abuse of the format: the 4th (name) column contains a 0-based row assignment. The *name field in BED files can be thought of as naming the scale of the data (corresponding to a row). Since genome coordinate ranges in heatmap data would not typically be associated with other names, this is not such an abuse of the BED format. The 5th (score) column is used for it's intended purpose: a score.

This arrangement also allows additional runtime optimizations:

  1. the mute color corresponding to the most "uninteresting" data range is provided by the config.altColor. Thus, no cells that would be mapped to this value need to be included in the input data! The data is thereby minified.
  2. Obviously, the range of values mapped to this color can be adjusted at data preparation time to effectively compress the data with minimal loss of information (similar what is done in preparation of a JPG image).
  3. Similarly, adjacent cells that are not "too" different could be optionally merged during data preparation to further reduce size.

These data preparation optimizations are, or course, optional but advisable in the interest of performance.

New files

Only one new JavaScript file, multiscalehm.js, is added providing:

  1. a default, linear colormap function and
  2. a small selection of 64-color palettes (some adapted from Matplotlib and others generated by the colorspace package from R)
  3. a renderCell function called by FeatureTrack.draw.

Only the renderCell function is necessary. The colormap function and palettes could be made the user's responsibility to be defined in the config, but as a suitable palette and colormap function is always necessary and a linear map is most common, providing these as defaults reduces work for user. Importantly both can still be defined entirely in the config, maximizing generality.

IGV code changes

With the above considerations only a few edits to IGV were necessary:

  1. A new config type is added: "multiscaleheat"
  2. Exports from multiscalehm.js are exposed in index.js to make the default colormap and palettes available to the user in their config.
  3. A one-line change to trackFactory.js mapping "multiscaleheat" to "feature".
  4. A conditional in TextFeatureSource.loadFeatures that parses the 4th column of a BED file as an integer and assigns the resulting number to the (already-existing) row attribute of Feature (pre-empting the call to packFeatures).
  5. Only a few changes in FeatureTrack:
    1. setting renderCell as the FeatureTrack.render method
    2. setting background color
    3. preclude the code that "Ensure[s] a visible gap between features"

User requirements

The following should be set in the track config:

  1. maxRows - the number of rows in the heatmap
  2. height - the pixel height of the heatmap track. This is used verbatim. In particular, displayMode and its related variables are not used by heatmap tracks.
  3. color - a function taking a Feature as it's sole argument.
  4. altColor - used as the "background" and support sparser input data. This color should typically correspond to the most "uninteresting" color in the heatmap's palette.

Defaults are provided for everything that insure something is displayed, though it will certainly not be ideal without user configuration, and it won't even be correct if maxRows is unset.

As is, the implementation simply make full use of configured space, so heatmap lines are config.height / config.maxRows pixels. In particular, squishedRowHeight and related config variables are ignored, and no runtime adjustment of track height should occur.

The maxRows config element could be made optional since largest row index can be inferred at runtime, but requiring specification of maxRows simplifies the implementation (being known before data is parsed). May also want to use scaleCount as a more meaningful alias.

Input must come from BED files with:

  1. 0-based row number as the first of possibly multiple semi-colon-delimited subfields of the 4th column, and
  2. score in [0,1] in the 5th column
  3. The 6th (strand) column is ignored

As with my previous stacked bar graph, I'll submit a pull request if this is of interest to the group. Thanks, roger kramer, bioinformatician University of Eastern Finland

jrobinso commented 1 year ago

The comment on #1594 would apply here as well. Overall there are too many changes to igv.js here to accommodate a track and file convention without a user community. Again perhaps this illustrates the need for a "contrib" plugin capability. In this case you would need to supply the track and a parser as you are in effect creating a new file format. So its perhaps more difficult than #1594 .

One meta comment, igv.js already has a heatmap track and format, "seg", for segmented copy number. Its possible this track and format ("seg" is a widley used standard format for copy number) would make a better basis than a bed track, with less special cases.