nmdc:wfmp-11-emfy6143.1 passes validation but fails conversion to RDF due to input pattern

turbomam commented 2 months ago

I did make squeaky-clean all test make-rdf in berkeley-schema-fy24. I haven't doen that in a while and added some collection that I may have never run through make-rdf before.

poetry run linkml-validate \
    --schema nmdc_schema/nmdc_materialized_patterns.yaml local/mongo_as_nmdc_database_rdf_safe.yaml

passes, but

poetry run linkml-convert \
    --output local/mongo_as_nmdc_database.ttl \
    --schema nmdc_schema/nmdc_materialized_patterns.yaml local/mongo_as_nmdc_database_rdf_safe.yaml

emits

Failed validating 'pattern' in schema[6]['properties']['has_input']['items']:
    {'pattern': '^(nmdc):(bsm|procsm)-([0-9][a-z]{0,6}[0-9])-([A-Za-z0-9]{1,})$',
     'type': 'string'}

On instance['has_input'][0]:
    'nmdc:dobj-11-agsd2f41'

corresponding to this fragment:

- id: nmdc:wfmp-11-emfy6143.1
  name: Metaproteomics Analysis Activity for nmdc:wfmp-11-emfy6143.1
  started_at_time: '2024-08-14T00:07:16+00:00'
  ended_at_time: '2024-08-14T04:37:30+00:00'
  was_informed_by: nmdc:omprc-11-5svnja50
  execution_resource: EMSL
  git_url: https://github.com/microbiomedata/metaPro/releases/tag/v1.2.1
  has_input:
  - nmdc:dobj-11-agsd2f41
  - nmdc:dobj-11-2f3gzn94
  - nmdc:dobj-11-8yvaz057
  - nmdc:dobj-11-h9637w90
  - nmdc:dobj-11-hfx93f93
  - nmdc:dobj-11-sprrem27
  has_output:
  - nmdc:dobj-11-sx7cyr58
  - nmdc:dobj-11-p2c98g23
  - nmdc:dobj-11-gmv0d626
  - nmdc:dobj-11-hfjbht29
  type: nmdc:MetaproteomicsAnalysis
  version: v1.2.1

Traceback

> File "/home/mark/.cache/pypoetry/virtualenvs/nmdc-schema-gXr5ogK9-py3.10/bin/linkml-convert", line 8, in > sys.exit(cli()) > File "/home/mark/.cache/pypoetry/virtualenvs/nmdc-schema-gXr5ogK9-py3.10/lib/python3.10/site-packages/click/core.py", line 1157, in __call__ > return self.main(*args, **kwargs) > File "/home/mark/.cache/pypoetry/virtualenvs/nmdc-schema-gXr5ogK9-py3.10/lib/python3.10/site-packages/click/core.py", line 1078, in main > rv = self.invoke(ctx) > File "/home/mark/.cache/pypoetry/virtualenvs/nmdc-schema-gXr5ogK9-py3.10/lib/python3.10/site-packages/click/core.py", line 1434, in invoke > return ctx.invoke(self.callback, **ctx.params) > File "/home/mark/.cache/pypoetry/virtualenvs/nmdc-schema-gXr5ogK9-py3.10/lib/python3.10/site-packages/click/core.py", line 783, in invoke > return __callback(*args, **kwargs) > File "/home/mark/.cache/pypoetry/virtualenvs/nmdc-schema-gXr5ogK9-py3.10/lib/python3.10/site-packages/linkml/utils/converter.py", line 153, in cli > validation.validate_object(obj, schema) > File "/home/mark/.cache/pypoetry/virtualenvs/nmdc-schema-gXr5ogK9-py3.10/lib/python3.10/site-packages/linkml/utils/validation.py", line 46, in validate_object > return jsonschema.validate( > File "/home/mark/.cache/pypoetry/virtualenvs/nmdc-schema-gXr5ogK9-py3.10/lib/python3.10/site-packages/jsonschema/validators.py", line 1332, in validate > raise error > jsonschema.exceptions.ValidationError: 'nmdc:dobj-11-agsd2f41' does not match '^(nmdc):(bsm|procsm)-([0-9][a-z]{0,6}[0-9])-([A-Za-z0-9]{1,})$'

turbomam commented 2 months ago

The structured_pattern in https://microbiomedata.github.io/berkeley-schema-fy24/MetaproteomicsAnalysis/#induced

seems to imply that '{id_nmdc_prefix}:(dobj)-{id_shoulder}-{id_blade}$' is expected

turbomam commented 2 months ago

now that we are aggregating all workflows into the workflow_execution_set, I can't exclude MetaproteomicsAnalysis instances!

aclum commented 2 months ago

does rdf not use the structured_pattern?

turbomam commented 2 months ago

does rdf not use the structured_pattern?

Good question. For the record, nothing uses structured_pattern directly at this point in time. To benefit from a structured_pattern, one has to re-generate the schema with something like gen-linkml --materialize-patterns, which the kind of process that generates nmdc_schema/nmdc_materialized_patterns.yaml. It's the patterns that are utilized.

I still think that this problem may be due to LinkML tooling rater than the nmdc-schema, though.

aclum commented 2 months ago

If it helps debug the only Classes that has a pattern match of (bsm|procsm) is DataGeneration and subclasses. Not sure how or where it is confusing a WorkflowExecution subclass for a DataGeneration subclass.

turbomam commented 2 months ago

Thanks @aclum ! @pkalita-lbl and I just worked through this issue. It turns out that the reported root cause error comes from a JsonSchema heuristic that tried to guess the most relevant error. In this case it is just wrong.

It appears that the real error is that the has_peptide_quantifications portion of nmdc:wfmp-11-emfy6143.1 is being converted from a list to a dict before the converter's validatior is run.

@cmungall has encouraged me to just run the converter in validation-free mode since I'm doing the conversion in a workflow, where the immediately preceding step is stand-alone validation.

@pkalita-lbl said that I could possibly create a minimal example that illustrates this case outside of the nmdc-schema. Then it might be easier for him to come up with a solution. I doubt that I will do that before the berkeley-schema-fy24 roll-out

turbomam commented 2 months ago

There are some similarities to

https://github.com/linkml/linkml/issues/2146

aclum commented 2 months ago

Does this only happen when the list size is large? We have other instances where the structure is complex, like credit roles and don't run into this issue.

turbomam commented 2 months ago

Does this only happen when the list size is large

I don't think so. @pkalita-lbl noticed that MetaproteomicsAnalysis.has_peptide_quantifications is not inlined_as_list even though it is multivalued and it's range is a class (PeptideQuantification) that doesn't have an identifier slot.

That's an illegal combination. I have a test for other cases like that but haven't create a PR yet. The only other case right now is Biosample.heavy_metals_meth

aclum commented 2 months ago

is there a test we can run w/linkml as part of the build process to catch this illegal combo? I assume the test you describe above is an ad hoc check.

turbomam commented 2 months ago

@cmungall , @pkalita-lbl and I discussed this very briefly today. It could become a check in the linter, or we could just work towards more useful error messages. That might be tough in this case, because there are a lot of error messages and a heuristic is being used to guess the best one.

I just added a Python test and will discuss it in the metadata/schema meeting tomorrow.

aclum commented 2 months ago

An interim linter test is my preference since more useful error messages is a longer term effort, or at least seems that way from misleading error message tickets that I've filed.

microbiomedata / issues

nmdc:wfmp-11-emfy6143.1 passes validation but fails conversion to RDF due to input pattern #892