microbiomedata / nmdc-runtime

Runtime system for NMDC data management and orchestration
https://microbiomedata.github.io/nmdc-runtime/
Other
5 stars 3 forks source link

Run Ref Integrity Notebook on post-re-iding Mongo #555

Closed PeopleMakeCulture closed 3 months ago

PeopleMakeCulture commented 3 months ago

See #570 For deprecating schema accepting legacy IDs

aclum commented 3 months ago

Make sure nmdc-schema v10.5.5 is used, that will resolve some of the type errors for DataObject was_generated_by values that were being discussed at the infrastructure meeting on Thursday.

PeopleMakeCulture commented 3 months ago

Make sure nmdc-schema v10.5.5 is used, that will resolve some of the type errors for DataObject was_generated_by values that were being discussed at the infrastructure meeting on Thursday.

What's the best way to introspect about the schema version I'm importing?

PeopleMakeCulture commented 3 months ago

I re-ran the Bulk Validation and Ref Integrity notebook after updating nmdc-schema==10.5.5 in nmdc-runtime/requirements/main.in. Here are the new results:

len(errors["not_found"]), len(errors["invalid_type"])
# results prior to re-id-ing: (4857, 23503)
# results prior to v10.5.5: (33, 20488)
# results with v10.5.5: (33, 6900)

Results

The number of schema validation errors (eg "invalid_type") dropped from ~20,000 to ~7,000.

Samples

Here are some samples of the type of error that still remains (err msgs are formatted as f"{name} doc {doc['id']}: field {field} referenced doc {v} not of type {slot_range}")