microbiomedata / nmdc-schema

National Microbiome Data Collaborative (NMDC) unified data model
https://microbiomedata.github.io/nmdc-schema/
Creative Commons Zero v1.0 Universal
27 stars 8 forks source link

How will repair of NMDC DataHarmonizer submissions be triggered? #313

Closed turbomam closed 11 months ago

turbomam commented 12 months ago

@pkalita-lbl can you please help me update or close this issue?

Maybe this should be rephrased or split into two issues. Possibly at least one of those could be immediately closed.

pkalita-lbl commented 12 months ago

document how NMDC SubmissionPortal contents are converted into nmdc-schema objects and inserted into MongoDB

The code that does the translation is here: https://github.com/microbiomedata/nmdc-runtime/blob/46d6543339d2436524475a624644652d901a6517/nmdc_runtime/site/translation/submission_portal_translator.py. If you want to understand the particulars of what it does, the main "entry point" to that class is the get_database method.

Once the nmdc:Database object is prepared it is submitted to MongoDB via the /metadata/json:submit API endpoint.

The process of fetching from the submission portal, translation, and submitting to MongoDB is orchestrated by a Dagster job (there's actually a second Dagster job that does the first two steps and then just a validate step, which is useful for testing).

I can add something to that effect to the runtime documentation. But I'd caution against documenting at the level of "this field from the submission gets capitalized and reversed and put in this field of the Study object" because that will get stale and out of date so fast.

identify data patterns that might pass though that process but fail linkmml-validate against src/schema/nmdc.yaml

I don't know if what you're saying is actually possible. The data is validated against the schema at multiple points in the process before going into MongoDB. Regarding your example, as far as I can tell, the env_braod_scale has a description that recommends using certain ENVO terms, but there's nothing in the schema that technically enforces that. I guess this should have been caught by a manual review of the submission up-front, but it wasn't in this case.