Mock up an API endpoint that, given a `Study.id` and `DataObject.data_object_type`, returns the relevant `DataObject`s

eecavanna commented 11 months ago

This Issue (task) came out of today's metadata squad meeting (Wednesday, November 22, 2023).

Based on the first bullet point in https://github.com/microbiomedata/nmdc-runtime/issues/355...

...I'm envisioning an API endpoint that accepts a Study.id value and a DataObject.data_object_type value, and returns (i.e. responds with) a JSON array of all the DataObjects that are associated with that study and have that specific data_object_type value.

DataObject documentation:

https://microbiomedata.github.io/nmdc-schema/DataObject/

Here's an example of a "long" query team members have run; not necessarily related to this Issue other than to serve as an example of a long query, which is something endpoints like this one could save people from having to write themselves:

https://github.com/microbiomedata/notebook_hackathons/blob/soil-contig-tax/taxonomic_dist_by_soil_layer/python/mongodb_query.txt.js

eecavanna commented 11 months ago

Alternatively, mock up an API endpoint with the following inputs and outputs:

Inputs:
- Study.id (string)
- WorkflowExecutionActivity.type (string)
Outputs:
- Array of DataObjects that are referenced by WorkflowExecutionActivity.has_output

The API endpoint would find all the WorkflowExecutionActivitys that have the specified type and are associated with the specified Study; then find the DataObjects referenced by those WorkflowExecutionActivitys.

Reference:

https://microbiomedata.github.io/nmdc-schema/WorkflowExecutionActivity/

PeopleMakeCulture commented 10 months ago

Spun from #355

eecavanna commented 10 months ago

📓 FYI - I created this issue as a reminder to myself to mock up an endpoint that does what is described, so I could demo it to a teammate as a representative "endpoint that can abstract away a tedious query(ies)". It is not the case that I, personally, find myself doing this particular query often.

aclum commented 9 months ago

Is this going to be worked on this month?

dwinston commented 9 months ago

certainly can be. what would help immensely is knowing all of the paths from study to data object. I suppose this could be done via a graph algorithm on a graph representation of the schema.

@turbomam any ideas on how we could best do this?

aclum commented 9 months ago

I support using a graph representation of the schema, the berkeley refactor will add a number of new paths so this seems more robust than documenting the existing paths and having to add new ones in just a few weeks time.

ssarrafan commented 9 months ago

@dwinston @aclum let me know if this should be added to an NMDC or Infra sync meeting for discussion.

aclum commented 9 months ago

@ssarrafan please add to the infra sync meeting this week.

shreddd commented 9 months ago

Let's try and find a time when @dwinston can join us - 12-1p Pacific on Friday 2/9 could work.

PeopleMakeCulture commented 8 months ago

Proposal from @dwinston on Slack:

I think it's best to use a dedicated RDF system (Apache Jena Fuseki), deployed as a separate container, much in the same way the data portal uses a dedicated SQL system (Postgres) derived from Mongo, for its query needs.

This way, we don't have to develop and maintain a bespoke graph representation model and query language built on top of the mongo aggregation framework, which I think would slow agility. I have focused thus far on Apache Jena, specifically their TDB storage solution and Fuseki server, based in part on Wikidata's recent comparative evaluation of RDF backends and on some personal communication.

turbomam commented 8 months ago

I agree about using a dedicated RDF system, although its not easy to do every kind of imaginable graph operation over RDF data.

I think the Jena ecosystem is a good choice for something embedded.

I hope that this particular issue and the solutions that have been shared so far can be part of an overall discussion about

what RDF resources we will have for developers and end users
what data will be loaded, esp. regarding external data sets (like EnvO) and transformation of the data, like edge materialization.
what kinds of isolation will be used, like named graphs
what convenience features will be supported, like full text search, instance visualizations, export of tabular output, etc.

I have deployed some resources that do many of those things well, but I've only done it in ways that are convenient for me and the few people I discuss this with most often.

I appreciate the fact that @PeopleMakeCulture and @dwinston are bringing their high level of professionalism to this

ssarrafan commented 7 months ago

@shreddd @dwinston @PeopleMakeCulture @aclum This was on track to being complete by this month. Can someone give me a status update and close this ticket if done? Or add next steps and anticipated timeline to complete if not done? I could still use the list of accomplishments either way for the DOE quarterly report please.

dwinston commented 7 months ago

Status:

The mechanism for endpoint functionality (for this particular issue and for microbiomedata/issues#496 more generally) has been successfully developed via #488.
This involves new infrastructure — an RDF graph database derived from the source-of-truth MongoDB document database.
Because there is NMDC interest in such infrastructure beyond the topic of this GH issue, from @turbomam and others, we may hold off on deploying this solution — or we will at least refactor it — until we align on these other NMDC needs.
We are holding off until after GSP to move forward with tweaking the endpoint interface/affordances, as requested by @aclum, because there is value in “freezing” our release cycle for GSP.

PeopleMakeCulture commented 6 months ago

API Feedback from March 14 Infra Sync

@aclum please correct anything I got wrong here

results should look like what is returned in projections or a mongo lookup
wants to control what other info from a record gets shown
wants to specify target type by list, with the result showing which end objects are related by intermediate nodes
example: ["here's the omics processing, under that, here's the annotations, nested under that, here's the data objects"]

aclum commented 6 months ago

Yes, that is correct.

eecavanna commented 5 months ago

I moved this ticket back into the "On base" state because I think it was temporarily closed by accident, but that closure caused github-project-automation to change its state from "On base" to "Scored" and didn't automatically change it back to "On base" when the ticket was reopened. ⚾

Moving to next sprint because I don't expect it to be done tomorrow (I didn't work on it this sprint).

PeopleMakeCulture commented 5 months ago

@sujaypatil96 I can work on this with you on the 6/7 hackathon if you want

PeopleMakeCulture commented 5 months ago

Copying comments from @sujaypatil96 for additional context:

As part of the efforts in the NCBI Export squad, one of the requirements that has come up is the need to be able to retrieve DataObjects (ids and URLs) given a Biosample id. Ideally, this would be a specific case of NMDC Database roll-up, but since we don't have the "machinery" for that just yet, we will need to implement something custom for this use case in the meantime.

The code for the NCBI Export squad is being developed in PR https://github.com/microbiomedata/nmdc-runtime/pull/518

The two cases that we need to cover are:

The given Biosample id may be a direct input (through has_input key) on an OmicsProcessing record, the output (through has_output key) of which will be one or two DataObject ids, and we need to retrieve the DataObject records for those ids, or

The given Biosample id may be input into a lab processing class (Pooling, Extraction, LibraryPreparation) the output (through has_output key) of which will be a ProcessedSample, and that ProcessedSample will be input (through has_input key) into an OmicsProcessing record

Implementation details:

We can develop this either as an API endpoint or just as an @op and use it in code. Which would be better?

We can use the get_mongo_db() method or the mongo resource. Which would be better?

I'm also thinking that the method I implement will iterate over all the records in the alldocs collection

eecavanna commented 4 months ago

This ticket was originally about mocking up a specific endpoint to demonstrate how people could delegate the graph traversal to the Runtime instead of having to write complex Mongo queries. I originally assigned it to myself.

The scope of the task seems to me to have grown into: implementing a real (not mock-up) endpoint that can traverse the graph and will be able to accommodate changes to the schema/graph over time.

If this ticket represents the task of implementing a "roll up" endpoint (where "roll up" is a word the referential integrity/roll up squad members have been using lately), I'd propose updating its title and maybe reassigning it to @sujaypatil96 and putting it on the squad board for the referential integrity/roll up squad (I don't think the squad has a squad board yet).

sujaypatil96 commented 4 months ago

Agreed @eecavanna. I'll reassign this ticket to myself since I'm working on "rollup".

ssarrafan commented 4 months ago

@sujaypatil96 assuming you're still actively working on this issue? I'll move to the new sprint. Let me know if you're not planning to work on it.

sujaypatil96 commented 3 months ago

@ssarrafan yup, i'm pushing up a draft PR for this issue just now.

ssarrafan commented 3 months ago

@sujaypatil96 it looks like you still need to review #608 so I'll move this to the new sprint. Let me know if you won't be working on it.

aclum commented 2 months ago

@sujaypatil96 @eecavanna @shreddd This endpoint on dev doesn't allow for any new arguments which were in the original request. If find_data_objects_for_study_data_objects_study__study_id__get was updated on the backend to use alldocs but no other arguments are allowed this needs to be reopened.

microbiomedata / nmdc-runtime

Mock up an API endpoint that, given a `Study.id` and `DataObject.data_object_type`, returns the relevant `DataObject`s #401

API Feedback from March 14 Infra Sync

The two cases that we need to cover are:

Implementation details: