microbiomedata / nmdc-schema

National Microbiome Data Collaborative (NMDC) unified data model
https://microbiomedata.github.io/nmdc-schema/
Creative Commons Zero v1.0 Universal
27 stars 8 forks source link

incorrect range for was_informed_by #1846

Closed aclum closed 4 months ago

aclum commented 7 months ago

WorkflowExecutionActivity subclasses populated was_informed_by with OmicsProcessing values. @picowatt pointed out that the specified range for was_informed_by is Activity and OmicsProcessing is a PlannedProcess, rooted in NamedThing, not under Activity. I'm not sure if this is from re-rooting OmicsProcessing. This is not causing any errors in production but the documentation is inaccurate and confusing and I can't recall where we are at with linkml enforcing range. was_informed_by doesn't not currently have a pattern constraint on it which is why we haven't been seeing any issues.

aclum commented 6 months ago

@mbthornton-lbl notes we have the same issue with was_generated_by.

@cmungall @turbomam where are we at with having linkml enforce a range.

cmungall commented 6 months ago

Are you talking about applying id pattern constraints for foreign keys (full referential or integrity would require more coupling with mongo). I believe it is supported.

aclum commented 6 months ago

I’m talking about enforcing a class constraint via range without specifying structured_pattern.syntax

On Tuesday, April 2, 2024, Chris Mungall @.***> wrote:

Are you talking about applying id pattern constraints for foreign keys (full referential or integrity would require more coupling with mongo). I believe it is supported.

— Reply to this email directly, view it on GitHub https://github.com/microbiomedata/nmdc-schema/issues/1846#issuecomment-2032687462, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB6RD36QWPOHFJJ4UQNNZNTY3LVUJAVCNFSM6AAAAABEW276RCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMZSGY4DONBWGI . You are receiving this because you authored the thread.Message ID: @.***>

sierra-moxon commented 6 months ago

newbie question: @aclum - can you point me to the code that is returning the "type" of object in the was_informed_by slot on WorkflowExecutionActivity? Said another way, how do you know that type(WorkflowExecutionActivity.get("was_informed_by") == "nmdc:OmicsProcessing" -- this might help me figure out why the validation code can't recognize the error here.

e.g. can you tell by looking at the id of the data an incorrect object is being asserted here?

sierra-moxon commented 6 months ago

I'm also assuming this is operating against the main branch of NMDC schema, not the Berkeley schema?

aclum commented 6 months ago

Yes, the main branch of the schema. Mark updated the berkeley schema to enforce type but that isn't in main yet so for now you could look at the type code for the identifier (nmdc:omprc), look for that typecode in the structured_pattern.syntax in nmdc.yaml (ie https://github.com/microbiomedata/nmdc-schema/blob/f88ec8462b90102ebeddbcdc8903959883e175d0/src/schema/nmdc.yaml#L1695) look at is_a for that class. OmicsProcessing is_a PlannedProcess, from core.yaml PlannedProcess is a NamedThing. NameThing is a top level class. https://github.com/microbiomedata/nmdc-schema/blob/f88ec8462b90102ebeddbcdc8903959883e175d0/src/schema/core.yaml#L79

Alternatively you could use the graphdb version of the schema, if you search by ID as the 'subject' most of those documents have nmdc:type and nmdc:designated_class.

sierra-moxon commented 6 months ago

Thanks, yes, I found the schema requirements - I was curious how @picowatt determined that the data was wrong. Was it because the OmicsProcessing identifier had a specific nomeclature different than a typical Activity identifier nomeclature?

aclum commented 6 months ago

Yes, nmdc:omprc, they typecode for OmicsProcessing is different than the activity identifiers which are typically nmdc:wf*

sierra-moxon commented 6 months ago

Updates from my debugging:

The linkml JSONSchema generator currently recognizes linkml:pattern not linkml:structured_pattern metamodel syntax. linkml:pattern from the nmdc:id slot is being applied to all classes that use that slot. nmdc:was_informed_by has no pattern, but does have: range: Activity. I think the ask here is that the linkml JSONSchema generator:

Though kind of complex, I can at least see a path for support here. It's worth thinking about the edge cases and @cmungall might've already done so. (e.g. do we always only traverse the range itself? what if a range parent or grandparent has slot_usage for an identifier slot and not the range class itself? )

As a workaround, if I change nmdc:Activity to use a linkml:pattern in linkml:slot_usage instead of a linkml:structured_pattern (and unfortunately, we need to copy the regex syntax instead of the nice structured_pattern syntax) the validation works as expected, even with the slot_usage declared in the parent nmdc:Activity class.

Activity:
  description: Something that occurs over a period of time and acts upon or with entities; 
    it may include consuming, processing, transforming, modifying, relocating, using, or generating entities.
  slots:
    - id
    - was_informed_by
  mappings:
    - prov:Activity
  slot_usage:
    id:
      required: true
      pattern: '^nmdc:act-[a-zA-Z0-9_][a-zA-Z0-9_\-\/\.,]*$'
    was_informed_by:
      required: false
      pattern: '^nmdc:act-[a-zA-Z0-9_][a-zA-Z0-9_\-\/\.,]*$'

So I guess there are two (or possibly three) action items here?

1) support structured patterns in JSONSchema generator - there is already an open ticket and we've done some work to support structured patterns already, but not quite there yet: https://github.com/linkml/linkml/issues/1557, https://github.com/linkml/linkml/blob/main/tests/test_compliance/test_pattern_compliance.py

2) decide if we want to support the inference of rules from slot_usage of parent (grandparent) identifier slots. I'll take this to a linkml dev call.

3) decide if we want to use the pattern workaround in NMDC in the meantime.

The experimental repo that helped me debug this: https://github.com/sierra-moxon/nmdc-patterns

%  git clone https://github.com/sierra-moxon/nmdc-patterns # or fork
%  poetry install
%  cd nmdc-patterns
%  make gen-project
%  poetry run linkml-validate -f yaml -s src/nmdc_patterns/schema/nmdc_patterns.yaml -C WorkflowExecutionActivityCollection examples/Activity-001.yaml
turbomam commented 5 months ago

I think the general issue here is: can we force the nmdc-schema, using core LinkML tools, to require that mentioned ids follow the pattern specified in the definition of the id for the mentioned id class?

And at what cost?

I am going to create a test repo to check some of my assumptions too. Hopefully we can make a decision on this by next week's Schema/Metadata meeting.

sierra-moxon commented 5 months ago

My next step is to analyze using gen-linkml to materialize structured patterns.

aclum commented 5 months ago

@shreddd Tagging you in this issue as it pertains to the database discussion. This would be easy in a relational database with a foreign key constraint. In practice for re-iding this has meant we've had to write a bunch of ad hoc mongo scripts to try and cover the gaps in using linkmls existing validation.

aclum commented 5 months ago

@sierra-moxon @turbomam is there an update on this? We've had to implement ad-hoc code here to make sure the values for was_generated_by follow napa style identifiers. Is the only short term solution to set a pattern with slot usage?

aclum commented 5 months ago

@turbomam or @sierra-moxon do you have time to fix the ranges for was_informed_by and was_generated by separate if figuring out how to validate this. Code that @PeopleMakeCulture is prototyping checks the type and some of the data objects are showing errors b/c was_generated_by values are not of type Activity.

sierra-moxon commented 5 months ago

spoke about this at the LinkML dev call today. we propose a two-pronged approach here: 1) fix the "structured_pattern" support in JSONSchema generator 2) add flag to OWL generator to ignore structured patterns when a range constraint consisting of another class is specified.

Then in the NMDC schema, we specify structured patterns and class-range constraints to fully reconcile this.