microbiomedata / nmdc-schema

National Microbiome Data Collaborative (NMDC) unified data model
https://microbiomedata.github.io/nmdc-schema/
Creative Commons Zero v1.0 Universal
26 stars 8 forks source link

2024-06-18 `id` pattern validation summaries and SPARQL-based referential integrity checks on MongoDB contents with and without migration #2078

Open turbomam opened 2 weeks ago

turbomam commented 2 weeks ago

without migration

This is generally how I've done RDF dumps in the past.

Results:

linkml-validate \
    --schema nmdc_schema/nmdc_materialized_patterns.yaml \
    --target-class Database local/mongo_as_nmdc_database_rdf_safe.yaml > local/dump-vs-materialized.log
wc -l local/dump-vs-materialized.log

11914 local/dump-vs-materialized.log

awk 'BEGIN {srand()} !/^$/ { if (rand() <= .1) print $0}' local/dump-vs-materialized.log  > \
    local/dump-vs-materialized_10_pct.txt

dump-vs-materialized_10_pct.txt

Attached local/dump-vs-materialized_10_pct.txt into the https://claude.ai/ chat window, using the Claude 3 opus model, and asked:

summarize the validation failures in this attached text:

Response took > 1 minute

The validation failures in the attached text are related to IDs not matching the expected regular expression patterns. Here's a summary:

  1. Many data object IDs in the /data_object_set do not match the pattern '^(nmdc):dobj-([0-9][a-z]{0,6}[0-9])-([A-Za-z0-9]{1,})$'. The IDs contain hashes or other unexpected characters instead of the expected format.

  2. Many was_generated_by properties in the /data_object_set reference IDs that do not match the pattern '^(nmdc):(wfmag|wfmb|wfmgan|wfmgas|wfmsa|wfmp|wfmt|wfmtan|wfmtas|wfnom|wfrbt|wfrqc|wf|omprc)-([0-9][a-z]{0,6}[0-9])-([A-Za-z0-9]{1,})(\.[0-9]{1,})$'. The referenced IDs are missing the expected workflow prefix or have unexpected characters.

  3. Many nom_analysis_activity IDs in the /nom_analysis_activity_set do not match the pattern '^(nmdc):wfnom-([0-9][a-z]{0,6}[0-9])-([A-Za-z0-9]{1,})(\.[0-9]{1,})$'. The IDs are missing the expected version number at the end.

  4. Some read_based_taxonomy_analysis_activity IDs in the /read_based_taxonomy_analysis_activity_set do not match the pattern '^(nmdc):wfrbt-([0-9][a-z]{0,6}[0-9])-([A-Za-z0-9]{1,})(\.[0-9]{1,})$'. The IDs are missing the expected version number at the end.

  5. Some read_qc_analysis_activity IDs in the /read_qc_analysis_activity_set do not match the pattern '^(nmdc):wfrqc-([0-9][a-z]{0,6}[0-9])-([A-Za-z0-9]{1,})(\.[0-9]{1,})$'. The IDs are missing the expected version number at the end.

In summary, the validation failures are caused by IDs not adhering to the expected naming conventions and regular expression patterns defined for each type of object. The IDs contain unexpected characters or are missing required components like prefixes or version numbers.

turbomam commented 2 weeks ago

with migration

Results:

linkml-validate \
    --schema nmdc_schema/nmdc_materialized_patterns.yaml \
    --target-class Database local/mongo_as_nmdc_database_rdf_safe.yaml > local/migrated-dump-vs-materialized.log
wc -l local/migrated-dump-vs-materialized.log

10093 local/migrated-dump-vs-materialized.log

The Claude analysis overlaps a lot but isn't identical:

The attached text contains numerous validation errors in a dataset. The errors can be summarized as follows:

  1. Many data object IDs in the /data_object_set do not match the required regular expression pattern '^(nmdc):dobj-([0-9][a-z]{0,6}[0-9])-([A-Za-z0-9]{1,})$'. The non-matching IDs start with prefixes like 'emsl:', 'nmdc:22d29dd82f88784bc8bfdb6ce813581a', 'nmdc:40addaf8b9d84a5d076dbd654eb840a1', etc.

  2. Many workflow execution IDs referenced in the /workflow_execution_set//has_input/ fields do not match the required regular expression patterns '^^(nmdc):(wfmag|wfmb|wfmgan|wfmgas|wfmsa|wfmp|wfmt|wfmtan|wfmtas|wfnom|wfrbt|wfrqc)-([0-9][a-z]{0,6}[0-9])-([A-Za-z0-9]{1,})(\.[0-9]{1,})$|^^(nmdc):(omprc|dgms|dgns)-([0-9][a-z]{0,6}[0-9])-([A-Za-z0-9]{1,})$'. The non-matching IDs have prefix 'nmdc:wfnom-' followed by numbers like '11', '13' etc.

  3. Some data object IDs referenced in the /workflow_execution_set//has_input/ fields do not match the required regular expression pattern '^(nmdc):(bsm|procsm)-([0-9][a-z]{0,6}[0-9])-([A-Za-z0-9]{1,})$'. The non-matching IDs have prefix 'nmdc:dobj-' followed by numbers like '11', '13' etc.

In summary, the validation errors are due to data object and workflow IDs not conforming to the expected naming patterns/regular expressions defined in the validation rules.

turbomam commented 2 weeks ago

GraphDB server currently at http://35.173.42.85

turbomam commented 2 weeks ago

Loaded local/mongo_as_nmdc_database_cuire_repaired_stamped.ttl in my local nmdc-schema into the <https://api.microbiomedata.org> named graph in two new GraphDB repositories: nmdc-2024-08016 and nmdc-migrated-2024-08016

Loaded project/owl/nmdc.owl.ttl into the <https://w3id.org/nmdc/nmdc> named graph in the same repositories

turbomam commented 2 weeks ago

roll results up to studies... may require different paths in the two two GraphDB repositories

turbomam commented 2 weeks ago
# what about the paths to EnvO terms etc through env_braod_scale (MIXS:0000012) etc?
# those terms of ControleldTermValues etc do have type assertiosn in the RDF data now!
PREFIX nmdc: <https://w3id.org/nmdc/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dcterms: <http://purl.org/dc/terms/>
select 
distinct ?stype ?s ?p ?o ?informer ?study_parthood ?study_association
where {
    graph <https://api.microbiomedata.org> {
        ?s ?p ?o .
        filter(isiri(?o))
        minus {
            ?o ?op ?oo
        }
        minus {
            ?s nmdc:applied_roles ?o # don't all look like IRIs
        }
        minus {
            ?s nmdc:designated_class ?o
        }
        minus {
            ?s nmdc:doi_value ?o
        }
        minus {
            ?s nmdc:metabolite_identified ?o # but  check values separately
        }
        minus {
            ?s nmdc:metabolite_quantified ?o # but  check values separately
        }
        minus {
            ?s nmdc:model ?o
        }
        minus {
            ?s nmdc:vendor ?o
        }
        minus {
            ?s rdf:type ?o
        }
    }
    minus {
        ?allowed rdfs:subPropertyOf* nmdc:alternative_identifiers .
        ?s ?allowed ?o
    }
    optional {
        ?s nmdc:was_informed_by ?informer .
        optional {
            ?informer dcterms:isPartOf ?study_parthood
        }
        optional {
            ?informer nmdc:associated_studies ?study_association
        }
    }
    optional {
        ?s rdf:type ?stype
    }
}
turbomam commented 2 weeks ago

In both the migrated and un-migrated GraphDB repositories, there are 33 referential integrity violations, in which some workflow (from one of two studies) asserts an undefined thing in has_input

migrated-refinteg-violations-2024-06-18.csv