kbase / dts

A data transfer service
https://kbase.github.io/dts/
MIT License
0 stars 0 forks source link

Add an NMDC database integration #83

Open jeff-cohere opened 1 week ago

jeff-cohere commented 1 week ago

This issue tracks the development of an NMDC Database implementation within DTS. This work is being done in the nmdc branch.

NMDC Schema Notes

The DataObject is NMDC's file-level abstraction, and seems to have a 1:1 relationship with the Frictionless DataResources we're using in DTS. However, this type doesn't have very much metadata (it barely has a file path!).

DataObjects are evidently associated with metadata in NDMC via relationships with other classes in the schema. In particular, one can query NMDC for the DataObjects in a particular Study, which itself is related to other entities like PersonValue, CreditAssociation, and Doi.

It may be necessary for changes to be made to NMDC's endpoints to avoid several roundtrips in the process of gathering all the metadata for a set of DataObjects, but there's probably an optimal way to do things as they are.

Study-Driven Search

If we adopt Studies as a basis for finding files, we are led to the studies endpoint. Here's a possible flow, just to let us count the roundtrips, each of which is summed using increments in parenthesis:

This is a lot of roundtrips, and an illustration of the costs of normalized schemas for clients without privileged database access. However, I think we could work with NMDC folks to get a special endpoint put in place for DTS access that implements a single MongoDB query that constructs a completely populated set of data-objects-with-metadata that belong to a set of studies. If this is the way we want to go about it, anyway...