microbiomedata / nmdc-schema

National Microbiome Data Collaborative (NMDC) unified data model
https://microbiomedata.github.io/nmdc-schema/
Creative Commons Zero v1.0 Universal
26 stars 8 forks source link

Restore dump_single_modality version of pure-export #1985

Closed turbomam closed 1 month ago

turbomam commented 1 month ago

This adds to discreet modes to pure-export: one that exclusively dumps from an API and one that exclusively dumps from a MongoDB. The previous state was an awkward hybrid.

Most but not all of the command line options remain the same

see the two local/mongo_as_unvalidated_nmdc_database.yaml targets in project.Makefile. (I used the same name for each and just commented out the second)

turbomam commented 1 month ago

thanks @eecavanna. I applied those changes. I think there's a bug in my code, and the output is coming out malformed. As a result, migration-recursion is crashing.

I going to work on fixing my output, but I would also like to hear any feedback from you and @brynnz22 on what assumptions are made by migration-recursion regarding the structure or its inputs. I plan on taking a LLM approach to figuration that out eventually and will let you know if I make progress.

turbomam commented 1 month ago

My code (in API mode) was writing a list of documents, not a dictionary of lists of documents

turbomam commented 1 month ago

The optional Jena riot validation step after linkml-convert and before anyuri-strings-to-iris is complaining about the my_emsl prefix, even though I specifically pasted it at the top of local/mongo_as_nmdc_database.ttl.

After anyuri-strings-to-iris, riot doesn't complain any more.

Note that there are three different EMSL prefix expansion methods. All of these are defined in teh schema:


  1. emsl -> "http://example.org/emsl_in_mongodb/" doesn't require any intervention
  2. my_emsl -> "https://release.my.emsl.pnnl.gov/released_data/" is implemented by pasting assets/my_emsl_prefix.ttl onto the beginning of local/mongo_as_nmdc_database.ttl
  3. emsl_uuid_like -> "http://example.org/emsl_uuid_like/" is implemented with the --emsl-uuid-replacement option on anyuri-strings-to-iris. It is applied to any triple in which the predicate is nmdc:emsl_biosample_identifiers aka <https://w3id.org/nmdc/emsl_biosample_identifiers> and the object is an xsd:anyURI tagged string beginning with 'UUID:'