Closed jameshadfield closed 5 days ago
Awesome! This behavior looks spot on to me. Here's the current H5N1 cattle outbreak
Note that recent SRA tips are not completely gray. For example, the top clade descends from viruses sampled from South Dakota. These are appropriately colored a gray/green indicating potential South Dakota, but with little certainty.
The interpolation between SRA tips close to known Ohio viruses in blue to the the Michigan human case in lime also seems very appropriate.
I think Auspice is now doing exactly what it should be doing. However, we still should have a way to have a more data-informed decision about how to set --sampling-bias-correction
. We could be doing leave 10% out cross validation as Gytis did in the 2019 BMC Evol Biol paper.
How would one be able to differentiate the grey scale for uncertainty vs grey scale for unprovided colorings? For example, imagine if zika's region had uncertainty, it would be mixed in with the "Asia" grey colorings.
This is a really good point @joverlee521. The issue is that currently we use gray to mean either:
This uninteresting take can be seen here https://nextstrain.org/ncov/gisaid/north-america/6m@2020-05-01 for example. This felt semantically appropriate to distinguish focal samples from background samples.
I think this is okay however... This example does DTA on samples with a focal vs contextual color ramp so that uncertain nodes and contextual nodes are both gray. This feels okay and appropriate (perhaps not ideal, but not broken). It highlights clades that are more certain to be in a focal region.
That said, we should be fixing colorings like the Zika example so that random location is not gray. In the Zika example, "Asia" should be blue, like it is for country.
How would one be able to differentiate the grey scale for uncertainty vs grey scale for unprovided colorings? For example, imagine if zika's region had uncertainty, it would be mixed in with the "Asia" grey colorings.
Very difficult at the moment! Let's continue discussion in [maybe] differentiate between nodes with uncertainty vs nodes missing from colour scale
The previous code conveyed uncertainty in node attrs for branches by making them appear grey-er, but we never implemented this for tips; most likely because we never had a dataset with such data when this was built.
Here we use the same approach for tips as for branches, but with a slightly different parameterisation of the interpolation. The mapping of the entropy value into
[0,1]
(tipOpacityFunction
) was chosen so that tips with no (or very little) uncertainty look unchanged from previous Auspice versions, and uncertainty makes them appear more similar to the branch colour (for an equivalent uncertainty).There should be no visible changes for views without any uncertainty (genotype is a good one to use to test this), as well as traits where there is uncertainty in the dataset but not in the tips (e.g. ebola country / division reconstruction). Here's a side-by-side with the h5n1-cattle-flu dataset from https://github.com/nextstrain/avian-flu/pull/66, which identified this issue in Auspice (this PR LHS, current Auspice RHS):