microbiomedata / nmdc-schema

National Microbiome Data Collaborative (NMDC) unified data model
https://microbiomedata.github.io/nmdc-schema/
Creative Commons Zero v1.0 Universal
26 stars 8 forks source link

Schema contributors should perform routine maintenance on their invalid example files #1594

Closed turbomam closed 1 month ago

turbomam commented 7 months ago

It's obvious if a valid example file starts to fail, because make test won't complete

However, we don't have any mechanism for checking whether an invalid becomes "more invalid"

  1. the first part of a example data file names, up to the first hyphen, must be an existing class name, except when an invalid data file is demonstrating that a class is undefined
  2. all invalid example data files should have a single fixable fault
  3. the single fault should be clearly and succinctly described in the part of the filename that follows the first hyphen

Rules 2. and 3. can be checked with linkml-validate, like this

poetry run linkml-validate \
  --schema nmdc_schema/nmdc_materialized_patterns.yaml \
  --target-class Database src/data/invalid/Database-studies-undefined-doi-slot.yaml

Only one ERROR should be reported, and it should agree with the portion of the filename after the first hyphen

INFO:root:Using SchemaView with im=None
[ERROR] [src/data/invalid/Database-studies-undefined-doi-slot.yaml/0] 
 Additional properties are not allowed ('doi' was unexpected) in /study_set/0
eecavanna commented 6 months ago

As an alternative to maintaining a set of invalid examples over time, maybe this repository could contain only valid examples and then the invalid ones could be generated on-demand from those valid ones + the latest schema + a script that breaks slots according to some rules (e.g. if the schema specifies this slot contain a string, store the number 1 in it; if schema specifies this slot is required, delete it; etc.). In that case, the test procedure would be:

  1. Run script to generate invalid examples
  2. Run tests

Generating test data programmatically (i.e. what I describing here) does have some "code smell" to me.


Is there a way to get the validator to count the violations in a file?

eecavanna commented 6 months ago

Here's an open source tool (happens to be web-based) that people can use to generate data that is valid with respect to a given JSON Schema.

https://json-schema-faker.js.org/ (GitHub repo)

image

This tool generates valid data, but this GitHub Issue is about invalid data.

A tool that generates data that is valid, except for one slot, could be built upon this tool (e.g. do what this tool does, then target a single slot and change its value to be something that the schema says is not allowed there).

kheal commented 1 month ago

I suggest making this part of the PR template (https://github.com/microbiomedata/nmdc-schema/issues/1995)