microbiomedata / nmdc-schema

National Microbiome Data Collaborative (NMDC) unified data model
https://microbiomedata.github.io/nmdc-schema/
Creative Commons Zero v1.0 Universal
26 stars 8 forks source link

Add pattern validation of NMDC `id` mentions (in addition to `id` assertions) #1207

Closed turbomam closed 1 month ago

turbomam commented 8 months ago

cc @aclum

turbomam commented 8 months ago

start with any inbound or outbound relationship including

turbomam commented 8 months ago

I don't think LinkML does anything like this by default.

turbomam commented 8 months ago

@aclum has a report of inter-class relationships that we worked on together, but I just whipped this up too:

PREFIX nmdc: <https://w3id.org/nmdc/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
select
?p ?ot (count(?b) as ?bcount)
where {
    graph <mongodb://mongo-loadbalancer.nmdc.production.svc.spin.nersc.gov:27017> {
        ?b a nmdc:Biosample ;
           ?p ?o .
        ?o a ?ot .
    } minus 
    {
        ?o a ?vt .
        ?vt rdfs:subClassOf* nmdc:AttributeValue
    }
}
group by ?p ?ot
turbomam commented 8 months ago

There are some patterns like this in the data, from an RDF perspective:

turbomam commented 8 months ago

Having added most of the pattern constraints on slots that mention things with ids, but not having updated any of the example data files:

poetry run linkml-run-examples \
        --schema project/nmdc_schema_generated.yaml \
        --input-directory src/data/valid \
        --counter-example-input-directory src/data/invalid \
        --output-directory examples/output > examples/output/README.md

INFO:root:Using SchemaView with im=None Traceback (most recent call last): File "/home/mark/.cache/pypoetry/virtualenvs/nmdc-schema-w12NqEaO-py3.9/lib/python3.9/site-packages/linkml/workspaces/example_runner.py", line 186, in process_examples_from_list validator.validate_dict(input_dict, tc, closed=True) File "/home/mark/.cache/pypoetry/virtualenvs/nmdc-schema-w12NqEaO-py3.9/lib/python3.9/site-packages/linkml/validators/jsonschemavalidator.py", line 97, in validate_dict raise JsonSchemaDataValidatorError(results) linkml.validators.jsonschemavalidator.JsonSchemaDataValidatorError: 'gold:Gs0110115' does not match '^nmdc:stdy-[A-Z]{4}-[0-9]{4}-[0-9]{4}$' in $.biosample_set[0].part_of[0] 'gold:Gs0110115' does not match '^nmdc:stdy-[A-Z]{4}-[0-9]{4}-[0-9]{4}$' in $.biosample_set[1].part_of[0] 'gold:Gs0110115' does not match '^nmdc:stdy-[A-Z]{4}-[0-9]{4}-[0-9]{4}$' in $.biosample_set[2].part_of[0] 'gold:Gs0110115' does not match '^nmdc:stdy-[A-Z]{4}-[0-9]{4}-[0-9]{4}$' in $.biosample_set[3].part_of[0]

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/mark/.cache/pypoetry/virtualenvs/nmdc-schema-w12NqEaO-py3.9/bin/linkml-run-examples", line 8, in sys.exit(cli()) File "/home/mark/.cache/pypoetry/virtualenvs/nmdc-schema-w12NqEaO-py3.9/lib/python3.9/site-packages/click/core.py", line 1157, in call return self.main(args, kwargs) File "/home/mark/.cache/pypoetry/virtualenvs/nmdc-schema-w12NqEaO-py3.9/lib/python3.9/site-packages/click/core.py", line 1078, in main rv = self.invoke(ctx) File "/home/mark/.cache/pypoetry/virtualenvs/nmdc-schema-w12NqEaO-py3.9/lib/python3.9/site-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, ctx.params) File "/home/mark/.cache/pypoetry/virtualenvs/nmdc-schema-w12NqEaO-py3.9/lib/python3.9/site-packages/click/core.py", line 783, in invoke return __callback(args, **kwargs) File "/home/mark/.cache/pypoetry/virtualenvs/nmdc-schema-w12NqEaO-py3.9/lib/python3.9/site-packages/linkml/workspaces/example_runner.py", line 319, in cli runner.process_examples() File "/home/mark/.cache/pypoetry/virtualenvs/nmdc-schema-w12NqEaO-py3.9/lib/python3.9/site-packages/linkml/workspaces/example_runner.py", line 139, in process_examples self.process_examples_from_list(input_examples, fmt, False) File "/home/mark/.cache/pypoetry/virtualenvs/nmdc-schema-w12NqEaO-py3.9/lib/python3.9/site-packages/linkml/workspaces/example_runner.py", line 192, in process_examples_from_list raise ValueError(f"Example {input_example} failed validation:\n{e}") ValueError: Example src/data/valid/Database-biosamples-1.yaml failed validation: 'gold:Gs0110115' does not match '^nmdc:stdy-[A-Z]{4}-[0-9]{4}-[0-9]{4}$' in $.biosample_set[0].part_of[0] 'gold:Gs0110115' does not match '^nmdc:stdy-[A-Z]{4}-[0-9]{4}-[0-9]{4}$' in $.biosample_set[1].part_of[0] 'gold:Gs0110115' does not match '^nmdc:stdy-[A-Z]{4}-[0-9]{4}-[0-9]{4}$' in $.biosample_set[2].part_of[0] 'gold:Gs0110115' does not match '^nmdc:stdy-[A-Z]{4}-[0-9]{4}-[0-9]{4}$' in $.biosample_set[3].part_of[0] make: *** [project.Makefile:276: examples/output] Error 1

aclum commented 8 months ago

@turbomam I don't understand your 'probably unintentional' comment about the gold slots. Do you mean that the values in id and gold_study_identifiers can be the same? This will be resolved with re-iding.

turbomam commented 8 months ago

Do you mean that the values in id and gold_study_identifiers can be the same? This will be resolved with re-iding.

Yes, that's what I meant. I'll update those annotations.

turbomam commented 8 months ago

Study's slot_usage on structured_pattern.syntax for id:

where

so correct mentioned id pattern validation to

turbomam commented 8 months ago
aclum commented 8 months ago

@turbomam do you want to work off of this ticket or #1212 , they are redundant as far as I can tell.

aclum commented 2 months ago

We need a pattern constraints on was_informed_by, has_calibration, was_generated_by, has_input, has_output

turbomam commented 2 months ago

OK, I will come up with a uniform way of doing this.

sierra-moxon commented 2 months ago

@turbomam - this is planned for discussion today at metadata call -- the idea was a structured_pattern in slot_usage for ids (plus support in linkml JSONSchema generator for this construct) and range constraints. The other "fix" to make this the uniform way of doing this, is to "fix" the OWL generator to use the range constraint and not the structured_pattern.

aclum commented 1 month ago

This was resolved by https://github.com/microbiomedata/nmdc-schema/pull/1994