Closed eecavanna closed 2 months ago
Alternatively, mock up an API endpoint with the following inputs and outputs:
Study.id
(string)WorkflowExecutionActivity.type
(string)DataObject
s that are referenced by WorkflowExecutionActivity.has_output
The API endpoint would find all the WorkflowExecutionActivity
s that have the specified type
and are associated with the specified Study; then find the DataObject
s referenced by those WorkflowExecutionActivity
s.
Reference:
Spun from #355
đź““ FYI - I created this issue as a reminder to myself to mock up an endpoint that does what is described, so I could demo it to a teammate as a representative "endpoint that can abstract away a tedious query(ies)". It is not the case that I, personally, find myself doing this particular query often.
Is this going to be worked on this month?
certainly can be. what would help immensely is knowing all of the paths from study to data object. I suppose this could be done via a graph algorithm on a graph representation of the schema.
@turbomam any ideas on how we could best do this?
I support using a graph representation of the schema, the berkeley refactor will add a number of new paths so this seems more robust than documenting the existing paths and having to add new ones in just a few weeks time.
@dwinston @aclum let me know if this should be added to an NMDC or Infra sync meeting for discussion.
@ssarrafan please add to the infra sync meeting this week.
Let's try and find a time when @dwinston can join us - 12-1p Pacific on Friday 2/9 could work.
Proposal from @dwinston on Slack:
I think it's best to use a dedicated RDF system (Apache Jena Fuseki), deployed as a separate container, much in the same way the data portal uses a dedicated SQL system (Postgres) derived from Mongo, for its query needs.
This way, we don't have to develop and maintain a bespoke graph representation model and query language built on top of the mongo aggregation framework, which I think would slow agility. I have focused thus far on Apache Jena, specifically their TDB storage solution and Fuseki server, based in part on Wikidata's recent comparative evaluation of RDF backends and on some personal communication.
I agree about using a dedicated RDF system, although its not easy to do every kind of imaginable graph operation over RDF data.
I think the Jena ecosystem is a good choice for something embedded.
I hope that this particular issue and the solutions that have been shared so far can be part of an overall discussion about
I have deployed some resources that do many of those things well, but I've only done it in ways that are convenient for me and the few people I discuss this with most often.
I appreciate the fact that @PeopleMakeCulture and @dwinston are bringing their high level of professionalism to this
@shreddd @dwinston @PeopleMakeCulture @aclum This was on track to being complete by this month. Can someone give me a status update and close this ticket if done? Or add next steps and anticipated timeline to complete if not done? I could still use the list of accomplishments either way for the DOE quarterly report please.
Status:
@aclum please correct anything I got wrong here
Yes, that is correct.
I moved this ticket back into the "On base" state because I think it was temporarily closed by accident, but that closure caused github-project-automation
to change its state from "On base" to "Scored" and didn't automatically change it back to "On base" when the ticket was reopened. âšľ
Moving to next sprint because I don't expect it to be done tomorrow (I didn't work on it this sprint).
@sujaypatil96 I can work on this with you on the 6/7 hackathon if you want
Copying comments from @sujaypatil96 for additional context:
As part of the efforts in the NCBI Export squad, one of the requirements that has come up is the need to be able to retrieve DataObjects (ids and URLs) given a Biosample id. Ideally, this would be a specific case of NMDC Database roll-up, but since we don't have the "machinery" for that just yet, we will need to implement something custom for this use case in the meantime.
The code for the NCBI Export squad is being developed in PR https://github.com/microbiomedata/nmdc-runtime/pull/518
The two cases that we need to cover are:
- The given Biosample id may be a direct input (through has_input key) on an OmicsProcessing record, the output (through has_output key) of which will be one or two DataObject ids, and we need to retrieve the DataObject records for those ids, or
- The given Biosample id may be input into a lab processing class (Pooling, Extraction, LibraryPreparation) the output (through has_output key) of which will be a ProcessedSample, and that ProcessedSample will be input (through has_input key) into an OmicsProcessing record
Implementation details:
- We can develop this either as an API endpoint or just as an @op and use it in code. Which would be better?
- We can use the get_mongo_db() method or the mongo resource. Which would be better?
- I'm also thinking that the method I implement will iterate over all the records in the alldocs collection
This ticket was originally about mocking up a specific endpoint to demonstrate how people could delegate the graph traversal to the Runtime instead of having to write complex Mongo queries. I originally assigned it to myself.
The scope of the task seems to me to have grown into: implementing a real (not mock-up) endpoint that can traverse the graph and will be able to accommodate changes to the schema/graph over time.
If this ticket represents the task of implementing a "roll up" endpoint (where "roll up" is a word the referential integrity/roll up squad members have been using lately), I'd propose updating its title and maybe reassigning it to @sujaypatil96 and putting it on the squad board for the referential integrity/roll up squad (I don't think the squad has a squad board yet).
Agreed @eecavanna. I'll reassign this ticket to myself since I'm working on "rollup".
@sujaypatil96 assuming you're still actively working on this issue? I'll move to the new sprint. Let me know if you're not planning to work on it.
@ssarrafan yup, i'm pushing up a draft PR for this issue just now.
@sujaypatil96 it looks like you still need to review #608 so I'll move this to the new sprint. Let me know if you won't be working on it.
@sujaypatil96 @eecavanna @shreddd This endpoint on dev doesn't allow for any new arguments which were in the original request. If find_data_objects_for_study_data_objects_study__study_id__get was updated on the backend to use alldocs but no other arguments are allowed this needs to be reopened.
This Issue (task) came out of today's metadata squad meeting (Wednesday, November 22, 2023).
Based on the first bullet point in https://github.com/microbiomedata/nmdc-runtime/issues/355...
...I'm envisioning an API endpoint that accepts a
Study.id
value and aDataObject.data_object_type
value, and returns (i.e. responds with) a JSON array of all theDataObjects
that are associated with that study and have that specificdata_object_type
value.DataObject
documentation:Here's an example of a "long" query team members have run; not necessarily related to this Issue other than to serve as an example of a long query, which is something endpoints like this one could save people from having to write themselves: