nextstrain / augur

Pipeline components for real-time phylodynamic analysis
https://docs.nextstrain.org/projects/augur/
GNU Affero General Public License v3.0
268 stars 128 forks source link

measurements validation accepts inputs that cannot be parsed by JavaScript parser in Auspice #1105

Open huddlej opened 1 year ago

huddlej commented 1 year ago

Current Behavior

It is possible to create a measurements JSON where one or more data values are NaN, causing Auspice to fail to load the measurements JSON with the following errors:

image

However, the corresponding measurements JSON validates with Augur without any errors:

$ augur validate measurements auspice/h1n1pdm_2y_ha_measurements.json
Validating schema of 'auspice/h1n1pdm_2y_ha_measurements.json'...

$

The Augur validate passes because the Python JSON parser accepts and parses the NaN values.

Expected behavior

Ideally, augur validate measurements would throw an error when it encounters syntax that is not supported by the JS JSON parser in Auspice.

How to reproduce

Create a measurements table like the following (from tests/functional/measurements_export/collection.tsv) where the third row has a missing value in the value column:

strain  value   field_1 field_2 field_3
strain_1    1.0 value_1 value_1 value_1
strain_2    2.0 value_2 value_2 value_2
strain_3        value_3 value_3 value_3

Run a minimal measurements export as with this functional test, using the TSV above:

  $ ${AUGUR} measurements export \
  >   --collection measurements_export/collection.tsv \
  >   --grouping-column field_1 \
  >   --output-json "$TMP/minimal_measurements.json" &>/dev/null

This command should produce an error when it encounters the missing value that gets rendered as NaN in the exported JSON.

Possible solution

There are a couple of possible solutions to consider:

  1. Render missing data in a JSON-compatible format instead of the default NaN. Since we use the pandas DataFrame.to_dict method when we export the measurements themselves, we could fill missing values with an empty string or something more appropriate prior to calling that method.
  2. Change validation logic to throw an error when validating the exported JSON with NaN values. This logic could maybe be implemented at the JSON schema level or at the load_json step of the validation by calling json.load with more restrictive settings.

As a user, I would prefer to have validation throw an error when it finds NaN values instead of having measurements quietly replace the NaNs with empty strings, since this validation step would let me catch a data issue earlier.

jameshadfield commented 1 year ago

As a user, I would prefer to have validation throw an error when it finds NaN values instead of having measurements quietly replace the NaNs with empty strings, since this validation step would let me catch a data issue earlier.

Me too! NaN is not valid JSON (technically) so we should avoid it for that reason too. Looks like if we use json.dumps(... ignore_nan=False) then this example would throw a ValueError (which we can catch...)

joverlee521 commented 1 year ago

Ah, maybe we should standardize how we handle JSONs with the new augur.io.json module added in the augur curate work.