nextstrain / augur

Pipeline components for real-time phylodynamic analysis
https://docs.nextstrain.org/projects/augur/
GNU Affero General Public License v3.0
268 stars 128 forks source link

Improve validation output to identify problematic nodes / properties #1426

Open jameshadfield opened 9 months ago

jameshadfield commented 9 months ago

Validation against schemas is currently extremely useful at identifying when a dataset doesn't conform but often doesn't help pinpoint the part of the dataset that needs fixing.

As an example, this Auspice dataset ash_10samps.json (from https://github.com/nextstrain/auspice/issues/1756) failed validation in two places:

$ augur validate export-v2 ash_10samps.json
Validating schema of 'ash_10samps.json'...
  .tree {"children": [{"branch_attrs": {"labels": {"id":…} failed oneOf validation for [{"$ref": "#/$defs/tree"}, {"type": "array", "minItems": 1, "items": {"$ref": "#/$defs/tree"}}]
    validation for arm 0: {"$ref": "#/$defs/tree"}
      .tree {"children": [{"branch_attrs": {"labels": {"id":…} failed required validation for ["name", "node_attrs"]
    validation for arm 1: {"type": "array", "minItems": 1, "items": {"$ref": "#/$defs/tree"}}
      .tree {"children": [{"branch_attrs": {"labels": {"id":…} failed type validation for "array"
  .meta {"colorings": [{"key": "country", "title": "Coun…} failed required validation for ["updated", "panels"]
FATAL ERROR: Validation of 'ash_10samps.json' failed.

In instead we could change the error reporting to something like the following then debugging would be a whole lot easier, both for us internally and for external users:

- .tree {"children": [{"branch_attrs": {"labels": {"id":…} failed required validation for ["name", "node_attrs"]
+ .tree node name "wrapper" is missing the required "node_attrs" object
- .meta {"colorings": [{"key": "country", "title": "Coun…} failed required validation for ["updated", "panels"]
+ .meta object is missing the required "updated" property
jameshadfield commented 2 days ago

Brief notes running into this in a different context, our validation error code printed

.ancestral {"inference": "joint", "root_seq": false, "genom…} failed additionalProperties validation for false

Which isn't helpful. The first error.args value was helpful (in this case):

Additional properties are not allowed ('genome_root_seq' was unexpected)

Note that in this case error.validator_value = False which is where the "validation for false" phrasing comes from.