Closed aclum closed 4 months ago
@mbthornton-lbl notes we have the same issue with was_generated_by.
@cmungall @turbomam where are we at with having linkml enforce a range.
Are you talking about applying id pattern constraints for foreign keys (full referential or integrity would require more coupling with mongo). I believe it is supported.
I’m talking about enforcing a class constraint via range without specifying
structured_pattern.syntax
On Tuesday, April 2, 2024, Chris Mungall @.***> wrote:
Are you talking about applying id pattern constraints for foreign keys (full referential or integrity would require more coupling with mongo). I believe it is supported.
— Reply to this email directly, view it on GitHub https://github.com/microbiomedata/nmdc-schema/issues/1846#issuecomment-2032687462, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB6RD36QWPOHFJJ4UQNNZNTY3LVUJAVCNFSM6AAAAABEW276RCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMZSGY4DONBWGI . You are receiving this because you authored the thread.Message ID: @.***>
newbie question: @aclum - can you point me to the code that is returning the "type" of object in the was_informed_by
slot on WorkflowExecutionActivity? Said another way, how do you know that type(WorkflowExecutionActivity.get("was_informed_by") == "nmdc:OmicsProcessing"
-- this might help me figure out why the validation code can't recognize the error here.
e.g. can you tell by looking at the id of the data an incorrect object is being asserted here?
I'm also assuming this is operating against the main branch of NMDC schema, not the Berkeley schema?
Yes, the main branch of the schema. Mark updated the berkeley schema to enforce type but that isn't in main yet so for now you could look at the type code for the identifier (nmdc:omprc), look for that typecode in the structured_pattern.syntax in nmdc.yaml (ie https://github.com/microbiomedata/nmdc-schema/blob/f88ec8462b90102ebeddbcdc8903959883e175d0/src/schema/nmdc.yaml#L1695) look at is_a for that class. OmicsProcessing is_a PlannedProcess, from core.yaml PlannedProcess is a NamedThing. NameThing is a top level class. https://github.com/microbiomedata/nmdc-schema/blob/f88ec8462b90102ebeddbcdc8903959883e175d0/src/schema/core.yaml#L79
Alternatively you could use the graphdb version of the schema, if you search by ID as the 'subject' most of those documents have nmdc:type and nmdc:designated_class.
Thanks, yes, I found the schema requirements - I was curious how @picowatt determined that the data was wrong. Was it because the OmicsProcessing identifier had a specific nomeclature different than a typical Activity identifier nomeclature?
Yes, nmdc:omprc
, they typecode for OmicsProcessing is different than the activity identifiers which are typically nmdc:wf*
Updates from my debugging:
The linkml JSONSchema generator currently recognizes linkml:pattern
not linkml:structured_pattern
metamodel syntax. linkml:pattern
from the nmdc:id
slot is being applied to all classes that use that slot. nmdc:was_informed_by
has no pattern, but does have: range: Activity
. I think the ask here is that the linkml JSONSchema generator:
For a ClassB.slot_a with a range constraint of ClassC in the model, apply to ClassB.slot_a the rules/patterns/slot_usages of the linkml:identifier
slot on ClassC and validate accordingly. Is this right?
In our example, because nmdc:Activity
has a linkml:identifier
slot, nmdc:id
, and nmdc:was_informed_by
declares its range to be nmdc:Activity
, the JSONSchema generator would apply the slot_usage rules for nmdc:Activity
.nmdc:id
to the nmdc:was_informed_by
slot.
Though kind of complex, I can at least see a path for support here. It's worth thinking about the edge cases and @cmungall might've already done so. (e.g. do we always only traverse the range itself? what if a range parent or grandparent has slot_usage for an identifier slot and not the range class itself? )
As a workaround, if I change nmdc:Activity
to use a linkml:pattern
in linkml:slot_usage
instead of a linkml:structured_pattern
(and unfortunately, we need to copy the regex syntax instead of the nice structured_pattern syntax) the validation works as expected, even with the slot_usage declared in the parent nmdc:Activity
class.
Activity:
description: Something that occurs over a period of time and acts upon or with entities;
it may include consuming, processing, transforming, modifying, relocating, using, or generating entities.
slots:
- id
- was_informed_by
mappings:
- prov:Activity
slot_usage:
id:
required: true
pattern: '^nmdc:act-[a-zA-Z0-9_][a-zA-Z0-9_\-\/\.,]*$'
was_informed_by:
required: false
pattern: '^nmdc:act-[a-zA-Z0-9_][a-zA-Z0-9_\-\/\.,]*$'
So I guess there are two (or possibly three) action items here?
1) support structured patterns in JSONSchema generator - there is already an open ticket and we've done some work to support structured patterns already, but not quite there yet: https://github.com/linkml/linkml/issues/1557, https://github.com/linkml/linkml/blob/main/tests/test_compliance/test_pattern_compliance.py
2) decide if we want to support the inference of rules from slot_usage of parent (grandparent) identifier slots. I'll take this to a linkml dev call.
3) decide if we want to use the pattern
workaround in NMDC in the meantime.
The experimental repo that helped me debug this: https://github.com/sierra-moxon/nmdc-patterns
% git clone https://github.com/sierra-moxon/nmdc-patterns # or fork
% poetry install
% cd nmdc-patterns
% make gen-project
% poetry run linkml-validate -f yaml -s src/nmdc_patterns/schema/nmdc_patterns.yaml -C WorkflowExecutionActivityCollection examples/Activity-001.yaml
I think the general issue here is: can we force the nmdc-schema, using core LinkML tools, to require that mentioned ids follow the pattern specified in the definition of the id for the mentioned id class?
And at what cost?
I am going to create a test repo to check some of my assumptions too. Hopefully we can make a decision on this by next week's Schema/Metadata meeting.
My next step is to analyze using gen-linkml to materialize structured patterns.
@shreddd Tagging you in this issue as it pertains to the database discussion. This would be easy in a relational database with a foreign key constraint. In practice for re-iding this has meant we've had to write a bunch of ad hoc mongo scripts to try and cover the gaps in using linkmls existing validation.
@sierra-moxon @turbomam is there an update on this? We've had to implement ad-hoc code here to make sure the values for was_generated_by follow napa style identifiers. Is the only short term solution to set a pattern with slot usage?
@turbomam or @sierra-moxon do you have time to fix the ranges for was_informed_by and was_generated by separate if figuring out how to validate this. Code that @PeopleMakeCulture is prototyping checks the type and some of the data objects are showing errors b/c was_generated_by values are not of type Activity.
spoke about this at the LinkML dev call today. we propose a two-pronged approach here: 1) fix the "structured_pattern" support in JSONSchema generator 2) add flag to OWL generator to ignore structured patterns when a range constraint consisting of another class is specified.
Then in the NMDC schema, we specify structured patterns and class-range constraints to fully reconcile this.
WorkflowExecutionActivity subclasses populated was_informed_by with OmicsProcessing values. @picowatt pointed out that the specified range for was_informed_by is Activity and OmicsProcessing is a PlannedProcess, rooted in NamedThing, not under Activity. I'm not sure if this is from re-rooting OmicsProcessing. This is not causing any errors in production but the documentation is inaccurate and confusing and I can't recall where we are at with linkml enforcing range. was_informed_by doesn't not currently have a pattern constraint on it which is why we haven't been seeing any issues.