nextstrain / augur

Pipeline components for real-time phylodynamic analysis
https://docs.nextstrain.org/projects/augur/
GNU Affero General Public License v3.0
268 stars 129 forks source link

[export] reduce numeric precision to reduce dataset size by ~30% #1512

Closed jameshadfield closed 2 months ago

jameshadfield commented 3 months ago

Description of proposed changes

Reduces the precision of numerical values in the Auspice JSON (node attrs, confidence & entropy values) whilst keeping enough precision so that the rendering / display in Auspice is unchanged.

Motivated by this slack thread

It'd be good to get a few different eyes on this PR as it affects every single dataset Augur produces.

Related work

The output of augur traits doesn't currently distinguish between a (terminal) node with a provided value vs an inferred one. They both look something like

{
    "value": "foo",
    "confidence": {"foo": 1.0},
    "entropy": -1.000088900581841e-12
}

We could reduce the JSON size a bit more if we were able to know the value wasn't inferred and thus drop the confidence & entropy fields from the export. (This may require some small Auspice changes as well.)

Checklist

codecov[bot] commented 3 months ago

Codecov Report

Attention: Patch coverage is 93.47826% with 3 lines in your changes missing coverage. Please review.

Project coverage is 70.09%. Comparing base (eeb6237) to head (5f1a538). Report is 1 commits behind head on master.

Files Patch % Lines
augur/export_v2.py 95.55% 2 Missing :warning:
augur/traits.py 0.00% 1 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #1512 +/- ## ========================================== + Coverage 69.90% 70.09% +0.19% ========================================== Files 75 75 Lines 7888 7923 +35 Branches 1933 1938 +5 ========================================== + Hits 5514 5554 +40 + Misses 2088 2085 -3 + Partials 286 284 -2 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

jameshadfield commented 3 months ago

Updated the code a bit and added unit tests

jameshadfield commented 2 months ago

Docs CI failures due to https://github.com/sphinx-doc/sphinx-argparse/issues/56, but all tests pass.

Merging this PR now as per this advice on Slack:

we added the docs check not too long ago and now it's giving a false positive. I'd just ignore the failure for [this] PR. We should probably add some temporary pin but that can be a separate PR.