Remove extra logic to handle legacy identifiers - Githubissues

microbiomedata / nmdc-schema

National Microbiome Data Collaborative (NMDC) unified data model

https://microbiomedata.github.io/nmdc-schema/

Creative Commons Zero v1.0 Universal

26 stars 8 forks source link

Remove extra logic to handle legacy identifiers #1869

Closed aclum closed 5 days ago

aclum commented 3 months ago

After re-iding is complete we can remove the extra logic to support legacy IDs This is anything that generates ...accepting_legacy_ids...

Depends on: https://github.com/microbiomedata/issues/issues/532

There are runtime dependencies so moving this to after the June release so Jing has sometime also update the runtime code

Target release 2024.7

turbomam commented 3 weeks ago

I'm running make make-rdf in nmdc-schema now, with the default more-tolerant schema. I'll try the legacy-intolerant schema after that.

turbomam commented 3 weeks ago

make make-rdf completed. Now just to be extra obsessive, I'm running this:

linkml-validate \
    --schema nmdc_schema/nmdc_schema_accepting_legacy_ids.yaml 
    --target-class Database local/mongo_as_unvalidated_nmdc_database.yaml

No issues found

turbomam commented 3 weeks ago

linkml-validate \
    --schema  nmdc_schema/nmdc_materialized_patterns.yaml  \
    --target-class Database local/mongo_as_unvalidated_nmdc_database.yaml > prod-mondo-vs-materialized-structured-patterns.txt

turbomam commented 3 weeks ago

Get a 10% random sample of the lines in the validation log

awk 'BEGIN {srand()} !/^$/ { if (rand() <= .1) print $0}' prod-mondo-vs-materialized-structured-patterns.txt > prod-mondo-vs-materialized-structured-patterns_10_pct.txt

turbomam commented 3 weeks ago

Asked Claude:

summarize the validation failures in this attached text:

turbomam commented 3 weeks ago

Here is a summary of the validation failures in the provided 10% random sample: The validation failures are all related to IDs not matching the expected regular expression patterns. Specifically:

Many data object IDs in the /data_object_set do not match the pattern '^(nmdc):dobj-([0-9][a-z]{0,6}[0-9])-([A-Za-z0-9]{1,})$'. The IDs contain things like 'emsl:output_NNNNNN' or 'nmdc:NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN' instead.

Many workflow execution IDs referenced in /data_object_set/was_generated_by do not match the pattern '^(nmdc):(wfmag|wfmb|wfmgan|wfmgas|wfmsa|wfmp|wfmt|wfmtan|wfmtas|wfnom|wfrbt|wfrqc|wf|omprc)-([0-9][a-z]{0,6}[0-9])-([A-Za-z0-9]{1,})(.[0-9]{1,})$'. The IDs contain things like 'nmdc:wfnom-13-NNNNNNN' instead.

Many IDs in the /nom_analysis_activity_set do not match the pattern '^(nmdc):wfnom-([0-9][a-z]{0,6}[0-9])-([A-Za-z0-9]{1,})(.[0-9]{1,})$'. The IDs contain things like 'nmdc:wfnom-11-NNNNNNN' instead.

A few IDs in the /read_based_taxonomy_analysis_activity_set do not match the pattern '^(nmdc):wfrbt-([0-9][a-z]{0,6}[0-9])-([A-Za-z0-9]{1,})(.[0-9]{1,})$'.

A few IDs in the /read_qc_analysis_activity_set do not match the pattern '^(nmdc):wfrqc-([0-9][a-z]{0,6}[0-9])-([A-Za-z0-9]{1,})(.[0-9]{1,})$'.

In summary, many of the IDs across the different data object sets do not conform to the expected NMDC ID patterns, containing IDs from other systems like EMSL or using inconsistent NMDC ID formats. Standardizing the ID formats would resolve these validation failures.

turbomam commented 3 weeks ago

prod-mondo-vs-materialized-structured-patterns_10_pct.txt

aclum commented 2 weeks ago

@turbomam anything left to do here? The only remaining issues should be ID version for WorkflowExecutionActivity subclasses but there are no more legacy ID patterns (ie non-nmdc prefixes), the legacy version of the materialized pattern json schema file is no longer part of the release and @eecavanna has removed nmdc-runtime dependencies on the accepting legacy id version of the schema.

turbomam commented 1 week ago

I just reran the validation on a fresh dump earlier today. I think I agree with your statement in general. Or, I would say that I see ids that require a version suffix, but which lack that version suffix or have two or more version suffixes.

A related issue is that some structured_patterns in the nmdc-schema appear to start with a ^ and others don't . I don't think the ^ is ever necessary, because it is included in the abstraction of the prefix. So that means we have some structured_patterns that materialize to something starting with ^^.

Also, I still have to remove two sections of code that repair broken CURIes and validate both the YAML dump and the RDF conversion output.

aclum commented 1 week ago

Can the outstanding issues you mention be addressed this sprint?

ssarrafan commented 1 week ago

Can the outstanding issues you mention be addressed this sprint?

Should this issue be broken up into smaller "outstanding issues" tickets? @aclum @turbomam

aclum commented 5 days ago

closing this in favor of more granular issues based on the outstanding tasks @turbomam identified.

turbomam commented 5 days ago

see

https://github.com/microbiomedata/nmdc-schema/pull/2094/files