Add an NMDC database integration

This issue tracks the development of an NMDC Database implementation within DTS. This work is being done in the nmdc branch.

NMDC Schema Notes

The DataObject is NMDC's file-level abstraction, and seems to have a 1:1 relationship with the Frictionless DataResources we're using in DTS. However, this type doesn't have very much metadata (it barely has a file path!).

DataObjects are evidently associated with metadata in NDMC via relationships with other classes in the schema. In particular, one can query NMDC for the DataObjects in a particular Study, which itself is related to other entities like PersonValue, CreditAssociation, and Doi.

It may be necessary for changes to be made to NMDC's endpoints to avoid several roundtrips in the process of gathering all the metadata for a set of DataObjects, but there's probably an optimal way to do things as they are.

Study-Driven Search

If we adopt Studies as a basis for finding files, we are led to the studies endpoint. Here's a possible flow, just to let us count the roundtrips, each of which is summed using increments in parenthesis:

A search for studies is performed using the above endpoint, with filters applied to select for desired study attributes. (1 roundtrip)
For each study returned by the search:
- relevant credit metadata is gathered from the study, including
  - a PI, if relevant (+1 roundtrip)
  - zero, one, or more credit associations (+1 roundtrip)
  - zero, one, or more DOIs (+1 roundtrip)
- IDs of data objects are fetched using the data_objects/study/{study_id} endpoint (+1 roundtrip)
- for each data object ID:
  - data object metadata is fetched using the data_objects/{data_object_id} endpoint (+1 roundtrip)
  - the data object metadata and its credit metadata are gathered into a Frictionless DataResource

This is a lot of roundtrips, and an illustration of the costs of normalized schemas for clients without privileged database access. However, I think we could work with NMDC folks to get a special endpoint put in place for DTS access that implements a single MongoDB query that constructs a completely populated set of data-objects-with-metadata that belong to a set of studies. If this is the way we want to go about it, anyway...

kbase / dts

Add an NMDC database integration #83

NMDC Schema Notes

Study-Driven Search