nextstrain / auspice

Web app for visualizing pathogen evolution
https://docs.nextstrain.org/projects/auspice/
GNU Affero General Public License v3.0
292 stars 163 forks source link

Proposal: Multi-colorby panel for titer visualization #1334

Closed trvrb closed 2 years ago

trvrb commented 3 years ago

This represents a strawman proposal for a generic approach to titer visualization that would be expressive enough to capture use cases for influenza ferret HI data as well as SARS-CoV-2 human neutralization data. I also wanted something that could extend beyond titers and work more general for exploring arrays of "traits" attached to tips in the tree.

My basic proposal is to add another panel to Auspice that would be called something like "Traits" that would provide a grid plot of trait values (rows) against tips (columns) with a continuous color ramp to denote trait value. See here for an example for influenza with grid taken from a recent WHO flu report, so that rows are different reference sera and columns are different clades of H3N2:

titer-viz-1

This is one of the primary plots we use for flu reports and it seems worthwhile to target it here.

Here, I've added a dropdown to the sidebar to select the particular "Display by" for this panel. This would be separate from defined "color by" options as each of these would require a vector of traits for each tip. But other options in this particular case could include things like "FRA distance (sub model)" or "HI distance (tree model)" or "Log2 HI titer".

I've added a toggle for whether to "group" tips (columns). The grouping variable would always be the currently selected tree coloring. This would keep panels linked and would allow the structure of columns in the Traits panel to be related back to tree. Ie I wouldn't make "Group by" a separate dropdown in the sidebar.

In this example, if Color By was changed to "region" tips would be re-grouped according to region and average value for each region x serum combination would be shown instead:

titer-viz-2

This is mainly an example of how the visualization could work, but it would still be interesting to see if particular regions have antigenically different viruses circulating.

Furthermore, in this example, if the "Group by Clade" was toggled off, each tip would become a column (sorted by tree order):

titer-viz-3

I don't think so useful in this circumstance (with ~1500 tips), but you can see how this could be filter to particular tips to see how these tips behave against the serum panel.


More generally, you could imagine encoding a variety of data in this format. Just as a few examples that immediately came to mind:

  1. "Mutational profile" where we collect some of these interesting S1 mutations vs S2 mutations vs NTD mutations vs RBD mutations that we're proposing as SARS-CoV-2 colorbys. If we grouped by clade it would make it clear that the clades 20H, 20I and 20J have significantly more S1 mutations in both NTD and RBD. Along these lines, you could also have a list of specific mutations (484K, 501Y, etc...) and construct a grid for these mutations.

titer-viz-4

  1. "DMS antibody escape" where we'd be plotting data from the Bloom Lab experiments shown here: https://jbloomlab.github.io/SARS2_RBD_Ab_escape_maps/. One set of data is escape from monoclonals in use ("antibodies" in the Bloom Lab site). It would be useful to know if particular clades have escaped particular antibodies. This could be accomplished via:

titer-viz-5

Similarly, you could want to see if mutations at 484 and 501 result in antibody escape which you could accomplish with a simple genotype coloring:

titer-viz-6

This approach could readily extend to individual human sera as measured through DMS or to synthetically pooled sera (convalescent sera vs vacinee sera, etc...).


I think this viz approach would need the Auspice JSON to be supplemented with vectors of traits for each option in the Traits panel. For example, in the "mutational profile" scenario, each tip would need an additional "trait" that would be things like [3, 4, 2, ... and then we'd define row names via a new meta field of something like traits.

("Traits" may be the wrong word here, given that augur traits is already taken)

The Sketch document used to generate these figures is available on GDrive as titer-viz-multi-colorby-v1.sketch.

jameshadfield commented 3 years ago

Thanks! Looking forward to more discussion on this. Here are my thoughts, especially how this could tie into the larger scale tree viz ideas we’ve been talking about. However I’m acutely aware that others here are more familiar with titer data and, just as importantly, how it’s being used by people outside the lab.

I’m going to use the term trait vector, rather than trait values, as the latter is easily confused with the trait values we currently have in auspice! It’s pretty easy to extend the JSON schema to accommodate something like this, and one could imagine being able to create such a vector out of existing node traits from within auspice itself. (I haven’t thought about a UI for this, but a first pass would be to allow the singular “Display” selectbox to cascade, like the dataset selectors.)

Switch the axes I’m suggesting this for a couple of reasons. Firstly, in most cases the number of tips >> the length of the trait vector, so large panels could overflow vertically. Secondly, this would allow a “grid” view whereby tips line up with heat map rows, which is essentially multi-colour-bys. Would this be super confusing for people familiar with the current layout of the reports?

image

Grouping variables Grouping variables would work similarly to your proposal when in “full” mode, but when in grid mode we open up the (future) possibilities to explore some tree manipulations we’ve been talking about, such as exploding trees or violin plots.

image

And then even more left-field ideas we’ve discussed such as specifying the violin-plot information for a single “tip” in the JSON, thus lifting the calculations to the JSON creator, and allowing auspice to display information which represents a much much larger number of tips than currently.

A note on variables Collapsing non-continuous variables is going to be hard & I suggest restricting the first iteration to avoid this. There are interesting things we can do given enough pixels, so this isn’t an unachievable aim. Even for continuous variables, there’s a question of how we should collapse (e.g. mean, median, 95% spread). I like the fact that in your renderings the colour-scale of the heat-map is clearly different from that used for the colour-by to quickly convey that they are rendering different information.

trvrb commented 3 years ago

Thanks so much for the feedback James! 🌟

I hadn't considered the side-by-side tree/traits panel idea and I really like how you've co-opted this into the commonly used Phandango-esque way that multiple traits are generally displayed for a phylogeny (particularly in bacterial genomics field). Very clever. I'd fully agree with the row/column swap. I picked tips as columns because I thought that compressing ~4000 columns into a bunch of thin vertical lines would look better than compressing ~4000 rows into a bunch of thin horizontal lines (given the usual panel aspect ratio). However, lining up with tree is a strong argument for tips as rows.

The "group by" suggestion is also super interesting. This would suggest elevating "Group by" to be a top level component (perhaps under Filter by) given that it would affect multiple panels. (But could start with just grouping Traits panel if necessary).

And yes, I was implicitly thinking to limit "group by" to categorical and ordinal variables.

I also like the term trait vector, as in there would be "scalar traits" and "vector traits".

jbloom commented 3 years ago

I like this idea. One of my first impressions is to agree with @jameshadfield about making the viruses the rows, not the columns to make it parallel the tree arrangement.

As far as things like the RBD deep mutational scanning data, I like this idea. I do have some other ideas for specific traits that could be plotted in addition to the antibody-level ones you have there (for instance, we are working on generating more "summary statistics" that aggregate across lots of antibodies in a meaningful way), but I think that is probably a downstream question to implementing this.

huddlej commented 3 years ago

Internally, we've discussed some other alternative views of these types of data that I want to mention as part of this conversation. The views I'll show below strongly advocate for a spatial encoding of quantitative data (log2 titer distances) and color encoding of categorical data like clade, geography, etc. based on the idea that we make more effective interpretations of quantitative data with these encodings.

These views focus first on answering the specific question of "Which reference strain's serum covers the most strains in circulating clades?" without considering implementation details for Auspice or a standalone web visualization tool. That specific question can also be considered in two parts including a) which clade do we pick? and b) given that we want a vaccine strain fo one of these few candidate clades, which serum best covers circulating strains?

These views also assume that a) we do not necessary need the representation of titer distances to be connected with a phylogeny and b) we want these representation to function both as a static figure (as in a PDF report) and an interactive visualization (as in nextstrain.org).

Plot mean antigenic distance by color

image

For comparison with the views below that show log2 titer distance as a spatial encoding on the x-axis, this view encodes titer distance by both color and text and uses the x and y axes to represent test strain clades and reference strains, respectively. Reference strains on the y-axis are ordered by increasing mean log2 titer distance from all strains such that upper rows show the reference strains whose sera best cover all strains.

Ordering of the y-axis allows us to rapidly identify the reference strain that best covers the circulating clades. The color encoding of antigenic distance allows us to qualitatively determine how well the best reference covers each circulating clade and also how antigenically drifted each clade is from the reference strain’s serum. The text encoding of antigenic distance allow us to make quantitative comparisons between cells even when the color is ambiguous.The encoding of the test clade on the x-axis allows us to identify which clades do not have measurements against specific reference strains.

This view does not allow us to view the uncertainty or variability of titer distances associated with each mean. Nor does it allow us to view the number of measurements associated with each mean. These secondary questions address our confidence in making the choice of one reference strain over another.

Below is a checklist of the different sub-use cases associated with the primary use case. I’ve checked the boxes that the heatmap view above addresses and will reuse this for the other plots below.

Plot mean antigenic distance (and CIs) on x-axis

image

Reference strains on the y-axis are annotated by their clade and ordered by increasing mean log2 titer distance from all strains such that upper rows show the reference strains whose sera best cover all strains. Within rows, diamonds and bars show the mean and 89% confidence interval (CI) by clade. The vertical gray line represents the traditional 4-fold titer drop (>2 log2 distance) used to identify strains that are antigenically distinct. Reference strains whose means and CIs are well to the left of this line more clearly cover circulating strains than those near or to the right of the line.

Implementation in Auspice would require adding support for categorical (and, specifically, ordinal) values on either axis and support for plotting summary statistics like mean and CI.

Plot mean antigenic distance (and CIs) on x-axis with color by clade

image

This view expands on the simpler view above by plotting the mean and 89% CI antigenic distance between each reference strain and each clade of circulating test strains. Clades are ordered from top to bottom in decreasing global frequency such that the highest frequency clade (A1b/135K, here) appears first.

This view is effectively identical to the heatmap view above in that it shows mean distances by reference and test clade. The primary differences here are the encoding of distance on the x-axis instead of by color and the ability to represent variability of these distances on the x-axis with CIs. These encodings allow us to compare distances between clades for a given reference and compare distances for specific clades across references.

This view somewhat obscures the overall difference between reference strains, since each reference has multiple clade entries, although we retain that information in the ordering of the y-axis. This information could be explicitly encoded by the addition of an “all clades” entry to the clade color-by. We can’t easily determine which clades are missing measurements, but we could guess that the points without error bars lack enough data for error bars. We can also identify references that are missing specific clades, but identifying which clades are missing data requires extra work.

Implementation in Auspice would require adding support for categorical (and, specifically, ordinal) values on either axis, support for plotting summary statistics like mean and CI, and the ability to “dodge” color-by groups within a categorical variable’s row.

Plot all antigenic distances (with mean and CIs) on x-axis with color by clade

image

This view expands on the view above by plotting the pairwise log2 titer distances between each reference strain and test strain as points along with the original mean +/- 89% CIs from previous plots.

By showing all data points along with summary statistics, this view shows how many measurements are available in each group and allows us to address all* use cases below.

This view suffers from occlusion of specific data points, as many titer measurements within a reference/clade group overlap. Similarly, the mean and error bar markers partially obscure some data points. Although we can qualitatively determine which group has few measurements, we still have to make an extra effort to identify which clades are missing measurements.

Implementation in Auspice would require adding support for categorical (and, specifically, ordinal) values on either axis, support for plotting summary statistics like mean and CI, and the ability to “dodge” color-by groups within a categorical variable’s row.

Plot all antigenic distances on x-axis with color by clade

image

This view simplifies the previous views by removing summary statistics and vertical grouping (“dodging”) of clades within each reference strain’s row. This view shows the raw distribution of antigenic distances per reference strain with some information about which clade each data point represents.

As a static visualization, this view suffers even more from occlusion of data points and does not allow us to easily compare distances between clades for a given reference strain or across reference strains.

However, as an interactive visualization (e.g., within nextstrain.org), this view could be useful for comparison of distances for 1-3 clades within and between references. For example, the following alternate view could be produced in Auspice by filtering the data to only the clades A1b/135K and 3c3.A.

image

This filtered view allows us to compare distances between clades for a given reference strain and more easily identify which clades have too few measurements. If we only consider this filtered view (as in the case where we know which few clades should be considered as vaccine targets), we can mostly address the use cases below.

Implementation in Auspice would only require adding support for categorical (and, specifically, ordinal) values on either axis.

Takeaways

  1. Support for ordering reference strains by their mean distance to all test strains enables easy identification of “the best reference strain” based on a summary statistic. This approach applies equally well to heatmap and scatterplot views.
  2. Heatmap views cannot represent uncertainty or number of data points. The importance of these two data features in our decision-making should guide our choice of visualizations.
  3. None of the views above adequately allow quantitative comparisons of number of measurements for specific reference/clade groups. If presence/absence information from heatmaps or qualitative comparisons from scatterplots do not address our use cases, we should develop a separate view specifically for these count data.
  4. Adding support to Auspice for scatterplots by categorical variables could enable an effective interactive comparison of antigenic distances for 1-3 clades with minimal other development work.
rneher commented 3 years ago

Thanks a lot for these explicit examples, @huddlej!

One way in which number of data points could be incorporated into a heatmap type display would be to replace each square by a little disk whose size encodes number of data points.

My sense is that the categorical color encoding works pretty well for <6 categories, but becomes rather difficult beyond that.

huddlej commented 3 years ago

Here is a quick attempt at a heatmap punch card where color encodes log2 titer distance and the radius of the circle encodes the number of measurements for a given reference strain/test clade pair.

image

It would be worth revisiting some of the figures from our last report with these approaches, too, to get a better sense of real world comparisons we make (for example, how many categories do we actually compare in a season?)

jameshadfield commented 3 years ago

A few updates:

I've implemented a prototype of the multi-color-bys proposed at the start of this thread -- I consider this viz to be the "worst case" view, as it's representing all the (unrelated) colourings which are available. I don't think it disappoints - it's suitably ugly!

image

Were we to expose a group of colourings and force each to use the same scale it would both look nicer and be informative. An example of this would be to define different colourings for each reference sera (e.g. 10 colourings), as well as a meta-colouring which links to these 10. This would mean that a single scale is used. There will inevitably be times where we don't want to reuse the scale - e.g. metaColouring = [division, country, region].

I really liked some of @huddlej's examples above. While they are broadly similar to the current scatterplot functionality, they have one big difference. Implementing them would require auspice / phylotree to be able to render multiple values for a single node in the tree, and breaks the 1-1 mapping of tree nodes to DOM elements which we use throughout Auspice. One way this could be done is by extending the scaterplots to use a meta-colouring as described above. For instance, the "Plot all antigenic distances on x-axis with colour by clade" plot is a scatterplot representing the "per sera titer measurements" meta-color-by. I think this would only be possible / make sense if each colour-by in the meta-colour-by shares a scale.

The heatmap punch card is different again, as it's using now encoding titer (the values of the selected scatterplot variables) as the radius, thus freeing up the y-axis to group by clade. Possible, but getting into a pretty complicated UI here...

huddlej commented 3 years ago

@jameshadfield In the prototype figure above, does color correspond to titer measurement or clade membership? If they are titer measurements, are these mocked up or real data? I'd expect real data to look very sparse since most test viruses will only be measured against a specific subset of sera.

While they are broadly similar to the current scatterplot functionality, they have one big difference. Implementing them would require auspice / phylotree to be able to render multiple values for a single node in the tree, and breaks the 1-1 mapping of tree nodes to DOM elements which we use throughout Auspice.

This is a tricky point, but it brings up the question that @trvrb originally asked when Alli and I presented our mockup from data viz class which is whether a standalone titer viz app might be more appropriate than trying to shoehorn the most appropriate visualization into Auspice.

That was before you'd implemented scatterplots, though. Now, you could imagine treating the "tree view" as a subset of broader scope of visualizations that a generic scatterplot could represent (e.g., multiple points in the plot for the same virus that appears once in the tree). Imagine if the Auspice JSON stored data in a tidy data frame format. Then you could make any kind of plot you like in the scatterplot interface. If you had a standard representation of how data points linked to each other in the tree (for example, each node stores a link to its parent and its own y-axis position in tree view), you could plot a tree, too. Tidy data frames (as opposed to the current recursive tree structure) provide other opportunities, too, like simplifying the creation of tabular displays of data and allowing others to consume JSONs directly for downstream analyses including basic exploration of the data with pandas, R's tidyverse, Mathematica, etc.

Regarding the prototypes I posted above, the view with just the means +/- CIs split by clade got the most positive feedback recently from stakeholders. Even these static views could be really helpful moving forward, if we didn't want to commit to a full titer viz app.

joverlee521 commented 2 years ago

Implemented in https://github.com/nextstrain/auspice/pull/1452.

Further improvements and feature requests can be added to https://github.com/nextstrain/auspice/issues/1463.