Closed aclum closed 5 days ago
I'm running make make-rdf
in nmdc-schema
now, with the default more-tolerant schema. I'll try the legacy-intolerant schema after that.
make make-rdf
completed. Now just to be extra obsessive, I'm running this:
linkml-validate \
--schema nmdc_schema/nmdc_schema_accepting_legacy_ids.yaml
--target-class Database local/mongo_as_unvalidated_nmdc_database.yaml
No issues found
linkml-validate \
--schema nmdc_schema/nmdc_materialized_patterns.yaml \
--target-class Database local/mongo_as_unvalidated_nmdc_database.yaml > prod-mondo-vs-materialized-structured-patterns.txt
Get a 10% random sample of the lines in the validation log
awk 'BEGIN {srand()} !/^$/ { if (rand() <= .1) print $0}' prod-mondo-vs-materialized-structured-patterns.txt > prod-mondo-vs-materialized-structured-patterns_10_pct.txt
Asked Claude:
summarize the validation failures in this attached text:
Here is a summary of the validation failures in the provided 10% random sample: The validation failures are all related to IDs not matching the expected regular expression patterns. Specifically:
Many data object IDs in the /data_object_set do not match the pattern '^(nmdc):dobj-([0-9][a-z]{0,6}[0-9])-([A-Za-z0-9]{1,})$'. The IDs contain things like 'emsl:output_NNNNNN' or 'nmdc:NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN' instead.
Many workflow execution IDs referenced in /data_object_set/was_generated_by do not match the pattern '^(nmdc):(wfmag|wfmb|wfmgan|wfmgas|wfmsa|wfmp|wfmt|wfmtan|wfmtas|wfnom|wfrbt|wfrqc|wf|omprc)-([0-9][a-z]{0,6}[0-9])-([A-Za-z0-9]{1,})(.[0-9]{1,})$'. The IDs contain things like 'nmdc:wfnom-13-NNNNNNN' instead.
Many IDs in the /nom_analysis_activity_set do not match the pattern '^(nmdc):wfnom-([0-9][a-z]{0,6}[0-9])-([A-Za-z0-9]{1,})(.[0-9]{1,})$'. The IDs contain things like 'nmdc:wfnom-11-NNNNNNN' instead.
A few IDs in the /read_based_taxonomy_analysis_activity_set do not match the pattern '^(nmdc):wfrbt-([0-9][a-z]{0,6}[0-9])-([A-Za-z0-9]{1,})(.[0-9]{1,})$'.
A few IDs in the /read_qc_analysis_activity_set do not match the pattern '^(nmdc):wfrqc-([0-9][a-z]{0,6}[0-9])-([A-Za-z0-9]{1,})(.[0-9]{1,})$'.
In summary, many of the IDs across the different data object sets do not conform to the expected NMDC ID patterns, containing IDs from other systems like EMSL or using inconsistent NMDC ID formats. Standardizing the ID formats would resolve these validation failures.
@turbomam anything left to do here? The only remaining issues should be ID version for WorkflowExecutionActivity subclasses but there are no more legacy ID patterns (ie non-nmdc prefixes), the legacy version of the materialized pattern json schema file is no longer part of the release and @eecavanna has removed nmdc-runtime dependencies on the accepting legacy id version of the schema.
I just reran the validation on a fresh dump earlier today. I think I agree with your statement in general. Or, I would say that I see id
s that require a version suffix, but which lack that version suffix or have two or more version suffixes.
A related issue is that some structured_pattern
s in the nmdc-schema
appear to start with a ^
and others don't . I don't think the ^
is ever necessary, because it is included in the abstraction of the prefix. So that means we have some structured_pattern
s that materialize to something starting with ^^
.
Also, I still have to remove two sections of code that repair broken CURIes and validate both the YAML dump and the RDF conversion output.
Can the outstanding issues you mention be addressed this sprint?
Can the outstanding issues you mention be addressed this sprint?
Should this issue be broken up into smaller "outstanding issues" tickets? @aclum @turbomam
closing this in favor of more granular issues based on the outstanding tasks @turbomam identified.
After re-iding is complete we can remove the extra logic to support legacy IDs This is anything that generates ...accepting_legacy_ids...
Depends on: https://github.com/microbiomedata/issues/issues/532
There are runtime dependencies so moving this to after the June release so Jing has sometime also update the runtime code
Target release 2024.7