microbiomedata / nmdc-schema

National Microbiome Data Collaborative (NMDC) unified data model
https://microbiomedata.github.io/nmdc-schema/
Creative Commons Zero v1.0 Universal
27 stars 8 forks source link

get `src/data/problem/valid/Database-img_mg_annotation_objects.yaml` to be clearly valid or clearly invalid, or delete it #1586

Open turbomam opened 8 months ago

turbomam commented 8 months ago

Hi @mbthornton-lbl and @pkalita-lbl

I could use some of your help with an example file that was passing in the nmdc-schema for the past 6 months but is failing with an ambiguous message in the berkeley-schema-fy24

I have been working on the nmdc:type slot in berkeley-schema-fy24 and am going to merge my work. In my branch's current state, everything builds, the valid examples pass, and the invalid files fail as expected.

This also succeeds:

poetry run linkml-validate \
  --schema src/schema/nmdc.yaml \
  --target-class Database src/data/problem/valid/Database-img_mg_annotation_objects.yaml

but if I

cp src/data/problem/valid/Database-img_mg_annotation_objects.yaml src/data/valid/Database-img_mg_annotation_objects.yaml
make squeaky-clean all test

I get a report like this for each

ValueError: Example src/data/valid/Database-img_mg_annotation_objects.yaml failed validation:
{'has_input': ['nmdc:b4b798cc9e7e9253ae8256a8237fd371'], 'git_url': 'https://img.jgi.doe.gov', 'name': 'MetagenomeAnnotation activity for gold:Gp0153825', 'has_output': ['nmdc:a1f2c190aa6d470f2eea681126e0470e', 'nmdc:7d69d28f4abec72a7ad66411312c37fb', 'nmdc:c3ea4b3caf0c86e27118b3ffd51014b8', 'nmdc:a79973ef9a0c96d13fa19b2725b21d17', 'nmdc:1055b8fab0f63a1e56312813f47897ec'], 'started_at_time': '2021-01-12T00:00:00+00:00', 'execution_resource': 'NERSC-Cori', 'part_of': 'nmdc:wfch-11-ab', 'type': 'nmdc:MetagenomeAnnotation', 'id': 'nmdc:wf-99-v7tNhU', 'ended_at_time': '2021-01-12T00:00:00+00:00'} is not valid under any of the given schemas in $.workflow_execution_set[0]

@pkalita-lbl can you please help us think of situations in which a data file would pass linkml-validate but fail in linkml-run-examples. Are they using two different genration of the validation code?

pkalita-lbl commented 8 months ago

can you please help us think of situations in which a data file would pass linkml-validate but fail in linkml-run-examples

Starting with LinkML 1.6.6 both of those CLIs use the same code under the hood. linkml-validate offers a lot more flexibility in terms of choosing loading and validating strategies, but by default both it and linkml-run-examples use the same strategy.

When I do a fresh checkout of the berkeley-schema-fy24 repo, run poetry install, and then run the linkml-validate command above it does not succeed. It says essentially the same thing as what you reported for linkml-run-examples:

[ERROR] [src/data/problem/valid/Database-img_mg_annotation_objects.yaml/0] Additional properties are not allowed ('type' was unexpected) in /
[ERROR] [src/data/problem/valid/Database-img_mg_annotation_objects.yaml/0] {'has_input': ['nmdc:b4b798cc9e7e9253ae8256a8237fd371'], 'git_url': 'https://img.jgi.doe.gov', 'name': 'MetagenomeAnnotation activity for gold:Gp0153825', 'has_output': ['nmdc:a1f2c190aa6d470f2eea681126e0470e', 'nmdc:7d69d28f4abec72a7ad66411312c37fb', 'nmdc:c3ea4b3caf0c86e27118b3ffd51014b8', 'nmdc:a79973ef9a0c96d13fa19b2725b21d17', 'nmdc:1055b8fab0f63a1e56312813f47897ec'], 'started_at_time': '2021-01-12T00:00:00+00:00', 'execution_resource': 'NERSC-Cori', 'part_of': ['nmdc:wfch-11-ab'], 'type': 'nmdc:MetagenomeAnnotation', 'id': 'nmdc:wf-99-v7tNhU', 'ended_at_time': '2021-01-12T00:00:00+00:00'} is not valid under any of the given schemas in /workflow_execution_set/0

Admittedly the failure message here isn't great but it is saying "This object, starting with the has_input key, is not valid under any of the given schemas in /workflow_execution_set/0". Okay, so what is it expecting in /workflow_execution_set/0? Look at the JSON Schema and it has:

        "workflow_execution_set": {
            "description": "This property links a database object to the set of workflow activities.",
            "items": {
                "anyOf": [
                    {
                        "$ref": "#/$defs/WorkflowExecution"
                    },
                    {
                        "$ref": "#/$defs/MetagenomeAssembly"
                    },
                    {
                        "$ref": "#/$defs/MetatranscriptomeAssembly"
                    },
                    {
                        "$ref": "#/$defs/MetagenomeAnnotation"
                    },
                    {
                        "$ref": "#/$defs/MetatranscriptomeAnnotation"
                    },
                    {
                        "$ref": "#/$defs/MetatranscriptomeAnalysis"
                    },
                    {
                        "$ref": "#/$defs/MagsAnalysis"
                    },
                    {
                        "$ref": "#/$defs/MetagenomeSequencing"
                    },
                    {
                        "$ref": "#/$defs/ReadQcAnalysis"
                    },
                    {
                        "$ref": "#/$defs/ReadBasedTaxonomyAnalysis"
                    },
                    {
                        "$ref": "#/$defs/MetabolomicsAnalysis"
                    },
                    {
                        "$ref": "#/$defs/MetaproteomicsAnalysis"
                    },
                    {
                        "$ref": "#/$defs/NomAnalysis"
                    }
                ]
            },
            "type": "array"
        }

Okay so it's supposed to be an array of objects and each one needs to match one of any number of different $ref subschemas. Since the data has type: nmdc:MetagenomeAnnotation and that's one of the allowed subschemas let's look there first:

        "MetagenomeAnnotation": {
            "additionalProperties": false,
            "description": "A workflow execution activity that provides functional and structural annotation of assembled metagenome contigs",
            "properties": {
                "alternative_identifiers": {
                    "description": "A list of alternative identifiers for the entity.",
                    "items": {
                        "pattern": "^[a-zA-Z0-9][a-zA-Z0-9_\\.]+:[a-zA-Z0-9_][a-zA-Z0-9_\\-\\/\\.,]*$",
                        "type": "string"
                    },
                    "type": "array"
                },
                "description": {
                    "description": "a human-readable description of a thing",
                    "type": "string"
                },
                "end_date": {
                    "description": "The date on which any process or activity was ended",
                    "type": "string"
                },
                "ended_at_time": {
                    "pattern": "^([\\+-]?\\d{4}(?!\\d{2}\\b))((-?)((0[1-9]|1[0-2])(\\3([12]\\d|0[1-9]|3[01]))?|W([0-4]\\d|5[0-2])(-?[1-7])?|(00[1-9]|0[1-9]\\d|[12]\\d{2}|3([0-5]\\d|6[1-6])))([T\\s]((([01]\\d|2[0-3])((:?)[0-5]\\d)?|24\\:?00)([\\.,]\\d+(?!:))?)?(\\17[0-5]\\d([\\.,]\\d+)?)?([zZ]|([\\+-])([01]\\d|2[0-3]):?([0-5]\\d)?)?)?)?$",
                    "type": "string"
                },
                "execution_resource": {
                    "$ref": "#/$defs/ExecutionResourceEnum"
                },
                "git_url": {
                    "type": "string"
                },
                "gold_analysis_project_identifiers": {
                    "description": "identifiers for corresponding analysis project in GOLD",
                    "items": {
                        "pattern": "^gold:Ga[0-9]+$",
                        "type": "string"
                    },
                    "type": "array"
                },
                "has_input": {
                    "description": "An input to a process.",
                    "items": {
                        "type": "string"
                    },
                    "type": "array"
                },
                "has_output": {
                    "description": "An output from a process.",
                    "items": {
                        "type": "string"
                    },
                    "type": "array"
                },
                "id": {
                    "description": "A unique identifier for a thing. Must be either a CURIE shorthand for a URI or a complete URI",
                    "pattern": "^[a-zA-Z0-9][a-zA-Z0-9_\\.]+:[a-zA-Z0-9_][a-zA-Z0-9_\\-\\/\\.,]*$",
                    "type": "string"
                },
                "instrument_used": {
                    "description": "What instrument was used during DataGeneration or MaterialProcessing.",
                    "items": {
                        "type": "string"
                    },
                    "type": "array"
                },
                "name": {
                    "description": "A human readable label for an entity",
                    "type": "string"
                },
                "part_of": {
                    "description": "The WorkflowChain that this WorkflowExecution is part of",
                    "type": "string"
                },
                "processing_institution": {
                    "$ref": "#/$defs/ProcessingInstitutionEnum",
                    "description": "The organization that processed the sample."
                },
                "protocol_link": {
                    "$ref": "#/$defs/Protocol"
                },
                "start_date": {
                    "description": "The date on which any process or activity was started",
                    "type": "string"
                },
                "started_at_time": {
                    "pattern": "^([\\+-]?\\d{4}(?!\\d{2}\\b))((-?)((0[1-9]|1[0-2])(\\3([12]\\d|0[1-9]|3[01]))?|W([0-4]\\d|5[0-2])(-?[1-7])?|(00[1-9]|0[1-9]\\d|[12]\\d{2}|3([0-5]\\d|6[1-6])))([T\\s]((([01]\\d|2[0-3])((:?)[0-5]\\d)?|24\\:?00)([\\.,]\\d+(?!:))?)?(\\17[0-5]\\d([\\.,]\\d+)?)?([zZ]|([\\+-])([01]\\d|2[0-3]):?([0-5]\\d)?)?)?)?$",
                    "type": "string"
                },
                "type": {
                    "description": "the id of the class that is instantiated by some data",
                    "enum": [
                        "nmdc:MetagenomeAnnotation"
                    ],
                    "type": "string"
                },
                "version": {
                    "type": "string"
                }
            },
            "required": [
                "ended_at_time",
                "execution_resource",
                "git_url",
                "part_of",
                "started_at_time",
                "has_input",
                "has_output",
                "id",
                "type"
            ],
            "title": "MetagenomeAnnotation",
            "type": "object"
        },

Is the data valid under that subschema? Nope, that says part_of is a string but we gave it an array.

You can also copy the entire JSON Schema (gen-json-schema src/schema/nmdc.yaml | pbcopy) into https://www.jsonschemavalidator.net/ along with a JSON-ified version of you data instance. It gives much more verbose information for these types of nested subschema failures.