nextstrain / augur

Pipeline components for real-time phylodynamic analysis
https://docs.nextstrain.org/projects/augur/
GNU Affero General Public License v3.0
268 stars 129 forks source link

[export] format confidence values to save bytes #1505

Open jameshadfield opened 3 months ago

jameshadfield commented 3 months ago

Currently confidence & entropy values appear in the JSON with more sig figs than are useful for Auspice and we could reduce the JSON size by using fewer. Three should be enough. For example:

{
    "value": "Idaho",
    "confidence": {
        "Idaho": 0.1796449506405896,
        "Kansas": 0.09376349763398112,
        "Michigan": 0.1021510102718335,
        "Texas": 0.10847767046483789
    },
    "entropy": 2.26939563620878
}

Additionally, augur traits will report uncertainty for tips where the value is known, and we should drop this during export, e.g.:

            "confidence": {
              "Cambodia": 1.0
            },
            "entropy": -1.000088900581841e-12,
            "value": "Cambodia"

The confidence values are transferred from the node-data JSON via 👇 however I'm not sure if there is any post-processing done on them later on

https://github.com/nextstrain/augur/blob/f6ee377336ec3813468fe641fa7910c14e54ced3/augur/export_v2.py#L798-L799

https://github.com/nextstrain/augur/blob/f6ee377336ec3813468fe641fa7910c14e54ced3/augur/export_v2.py#L779-L780

Within augur traits we do some pruning of the data:

https://github.com/nextstrain/augur/blob/f6ee377336ec3813468fe641fa7910c14e54ced3/augur/traits.py#L91