Open turbomam opened 5 months ago
v11.0.0-rc.16
tagged release of berkeley-schema-fy24
make squeaky-clean all test
make make-rdf
make make-rdf
scrolled off of my terminal, so I didn't save it.nmdc_schema/nmdc_schema_accepting_legacy_ids.yaml
linkml-validate \
--schema nmdc_schema/nmdc_materialized_patterns.yaml \
--target-class Database local/mongo_as_nmdc_database_rdf_safe.yaml > local/migrated-dump-vs-materialized.log
wc -l local/migrated-dump-vs-materialized.log
10093 local/migrated-dump-vs-materialized.log
The Claude analysis overlaps a lot but isn't identical:
The attached text contains numerous validation errors in a dataset. The errors can be summarized as follows:
Many data object IDs in the /data_object_set do not match the required regular expression pattern '^(nmdc):dobj-([0-9][a-z]{0,6}[0-9])-([A-Za-z0-9]{1,})$'. The non-matching IDs start with prefixes like 'emsl:', 'nmdc:22d29dd82f88784bc8bfdb6ce813581a', 'nmdc:40addaf8b9d84a5d076dbd654eb840a1', etc.
Many workflow execution IDs referenced in the /workflow_execution_set//has_input/ fields do not match the required regular expression patterns '^^(nmdc):(wfmag|wfmb|wfmgan|wfmgas|wfmsa|wfmp|wfmt|wfmtan|wfmtas|wfnom|wfrbt|wfrqc)-([0-9][a-z]{0,6}[0-9])-([A-Za-z0-9]{1,})(\.[0-9]{1,})$|^^(nmdc):(omprc|dgms|dgns)-([0-9][a-z]{0,6}[0-9])-([A-Za-z0-9]{1,})$'. The non-matching IDs have prefix 'nmdc:wfnom-' followed by numbers like '11', '13' etc.
Some data object IDs referenced in the /workflow_execution_set//has_input/ fields do not match the required regular expression pattern '^(nmdc):(bsm|procsm)-([0-9][a-z]{0,6}[0-9])-([A-Za-z0-9]{1,})$'. The non-matching IDs have prefix 'nmdc:dobj-' followed by numbers like '11', '13' etc.
In summary, the validation errors are due to data object and workflow IDs not conforming to the expected naming patterns/regular expressions defined in the validation rules.
GraphDB server currently at http://35.173.42.85
Loaded local/mongo_as_nmdc_database_cuire_repaired_stamped.ttl
in my local nmdc-schema into the <https://api.microbiomedata.org>
named graph in two new GraphDB repositories: nmdc-2024-08016
and nmdc-migrated-2024-08016
Loaded project/owl/nmdc.owl.ttl
into the <https://w3id.org/nmdc/nmdc>
named graph in the same repositories
roll results up to studies... may require different paths in the two two GraphDB repositories
# what about the paths to EnvO terms etc through env_braod_scale (MIXS:0000012) etc?
# those terms of ControleldTermValues etc do have type assertiosn in the RDF data now!
PREFIX nmdc: <https://w3id.org/nmdc/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dcterms: <http://purl.org/dc/terms/>
select
distinct ?stype ?s ?p ?o ?informer ?study_parthood ?study_association
where {
graph <https://api.microbiomedata.org> {
?s ?p ?o .
filter(isiri(?o))
minus {
?o ?op ?oo
}
minus {
?s nmdc:applied_roles ?o # don't all look like IRIs
}
minus {
?s nmdc:designated_class ?o
}
minus {
?s nmdc:doi_value ?o
}
minus {
?s nmdc:metabolite_identified ?o # but check values separately
}
minus {
?s nmdc:metabolite_quantified ?o # but check values separately
}
minus {
?s nmdc:model ?o
}
minus {
?s nmdc:vendor ?o
}
minus {
?s rdf:type ?o
}
}
minus {
?allowed rdfs:subPropertyOf* nmdc:alternative_identifiers .
?s ?allowed ?o
}
optional {
?s nmdc:was_informed_by ?informer .
optional {
?informer dcterms:isPartOf ?study_parthood
}
optional {
?informer nmdc:associated_studies ?study_association
}
}
optional {
?s rdf:type ?stype
}
}
In both the migrated and un-migrated GraphDB repositories, there are 33 referential integrity violations, in which some workflow (from one of two studies) asserts an undefined thing in has_input
without migration
This is generally how I've done RDF dumps in the past.
v10.5.4
tagged release ofnmdc-schema
poetry update
make squeaky-clean all test
make make-rdf
, with one modification relative tov10.5.4
--migrator-name migrator_from_9_3_to_10_0 \
from themigration-recursion
configuration in thelocal/mongo_as_nmdc_database_rdf_safe.yaml
targetnmdc_schema/nmdc_schema_accepting_legacy_ids.yaml
Results:
nmdc_schema/nmdc_schema_accepting_legacy_ids.yaml
dump-vs-materialized_10_pct.txt
Attached
local/dump-vs-materialized_10_pct.txt
into the https://claude.ai/ chat window, using the Claude 3 opus model, and asked:Response took > 1 minute