nextstrain / auspice

Web app for visualizing pathogen evolution
https://docs.nextstrain.org/projects/auspice/
GNU Affero General Public License v3.0
290 stars 159 forks source link

Convey uncertainty via tip colors #1796

Closed jameshadfield closed 5 days ago

jameshadfield commented 1 week ago

The previous code conveyed uncertainty in node attrs for branches by making them appear grey-er, but we never implemented this for tips; most likely because we never had a dataset with such data when this was built.

Here we use the same approach for tips as for branches, but with a slightly different parameterisation of the interpolation. The mapping of the entropy value into [0,1] (tipOpacityFunction) was chosen so that tips with no (or very little) uncertainty look unchanged from previous Auspice versions, and uncertainty makes them appear more similar to the branch colour (for an equivalent uncertainty).

There should be no visible changes for views without any uncertainty (genotype is a good one to use to test this), as well as traits where there is uncertainty in the dataset but not in the tips (e.g. ebola country / division reconstruction). Here's a side-by-side with the h5n1-cattle-flu dataset from https://github.com/nextstrain/avian-flu/pull/66, which identified this issue in Auspice (this PR LHS, current Auspice RHS):

Frame 12(1)

jameshadfield commented 1 week ago

Some URLs to compare this PR on nextstrain.org vs released Auspice on nextstrain.org:

Cattle-flu new & old

Zika (country) new & old. Note that this does have uncertainty for tips, which is kind of strange, but that's how augur traits currently works.

H3N2 (genotype view) new & old - no uncertainty here.

trvrb commented 1 week ago

Awesome! This behavior looks spot on to me. Here's the current H5N1 cattle outbreak

Screenshot 2024-06-28 at 11 42 31 AM

Note that recent SRA tips are not completely gray. For example, the top clade descends from viruses sampled from South Dakota. These are appropriately colored a gray/green indicating potential South Dakota, but with little certainty.

Screenshot 2024-06-28 at 11 45 19 AM

The interpolation between SRA tips close to known Ohio viruses in blue to the the Michigan human case in lime also seems very appropriate.

Screenshot 2024-06-28 at 11 47 02 AM

I think Auspice is now doing exactly what it should be doing. However, we still should have a way to have a more data-informed decision about how to set --sampling-bias-correction. We could be doing leave 10% out cross validation as Gytis did in the 2019 BMC Evol Biol paper.

trvrb commented 1 week ago

How would one be able to differentiate the grey scale for uncertainty vs grey scale for unprovided colorings? For example, imagine if zika's region had uncertainty, it would be mixed in with the "Asia" grey colorings.

This is a really good point @joverlee521. The issue is that currently we use gray to mean either:

  1. Unknown or uncertain
  2. Uninteresting

This uninteresting take can be seen here https://nextstrain.org/ncov/gisaid/north-america/6m@2020-05-01 for example. This felt semantically appropriate to distinguish focal samples from background samples.

I think this is okay however... This example does DTA on samples with a focal vs contextual color ramp so that uncertain nodes and contextual nodes are both gray. This feels okay and appropriate (perhaps not ideal, but not broken). It highlights clades that are more certain to be in a focal region.

That said, we should be fixing colorings like the Zika example so that random location is not gray. In the Zika example, "Asia" should be blue, like it is for country.

jameshadfield commented 5 days ago

How would one be able to differentiate the grey scale for uncertainty vs grey scale for unprovided colorings? For example, imagine if zika's region had uncertainty, it would be mixed in with the "Asia" grey colorings.

Very difficult at the moment! Let's continue discussion in [maybe] differentiate between nodes with uncertainty vs nodes missing from colour scale