microbiomedata / nmdc-runtime

Runtime system for NMDC data management and orchestration
https://microbiomedata.github.io/nmdc-runtime/
Other
5 stars 3 forks source link

Materialize a version of `all_docs` collection in dev mongodb #543

Closed PeopleMakeCulture closed 2 months ago

PeopleMakeCulture commented 4 months ago

@sujaypatil96 Could you describe the use case you had in mind?

sujaypatil96 commented 4 months ago

The alldocs collection that you are creating as part of your referential integrity checking and validation notebook/PR here: metadata-translation/notebooks/repl_validation_referential_integrity-1715162638.ipynb could be very useful to a use case that we have in one of the existing squads called the NCBI Export squad.

There is a PR on runtime that is implementing all the requirements laid out for the above squad. See here: https://github.com/microbiomedata/nmdc-runtime/pull/518

One of the requirements/blockers to continue the development on the above PR/squad is that we need a way to be able to retrieve the URLs of DataObject records (in data_object_set) given a Biosample record/id. Here are the two cases that need to be handled/covered:

So now we need a method (@op / API endpoint / etc.) to achieve the above. I need to be able to plug in a Biosample/id and retrieve DataObjects from it.

It's not realistic to do this search in realtime by iterating over all the different collections, but instead would be nice to have one materialized collection (like alldocs) using which we can implement a method to check the inputs and outputs and get back the desired DataObjects.

sujaypatil96 commented 3 months ago

The above usage is a specific use case of the Database roll-up described in #551