microbiomedata / nmdc-runtime

Runtime system for NMDC data management and orchestration
https://microbiomedata.github.io/nmdc-runtime/
Other
7 stars 3 forks source link

add aggregation commands to queries:run endpoint #447

Closed PeopleMakeCulture closed 9 months ago

PeopleMakeCulture commented 9 months ago

Split from: https://github.com/microbiomedata/issues/issues/496

From @aclum:

Allowing aggregation commands, either through queries: run or a new endpoint, would be very helpful towards this milestone and as an interim solution until we can have graph-based endpoints to do the traversing for users.

As things stand now you have to 1) download collections, 2) figure out the next field you want to query, then 3) run another API query.

This came up when discussing figuring out how to get some of the annotation files and proteomics raw data that come from a matched biosample, the combination of which is required as the input to the proteomics pipeline.

Typical starting identifier would be a biosample_set id, a omics_processing_set id where omics_type.has_raw_value=Proteomics, or a metaproteomics_analysis_activity_set id. Michal wrote a bit of python code do to this but it would be nice to be able to do this in a single API query.

From @dwinston:

There is some silliness re: mongo find command cursors not being valid for the mongo getMore command, but for some reason aggregate command cursors work fine. See below. So, I think we can allow aggregation commands with proper paging via POST /queries:run.

rv = mdb.command({"aggregate": "biosample_set", "pipeline":[{"$match": {}}], "cursor":{"batchSize": 10})
rv = mdb.command({"getMore": rv['cursor']['id'], "collection": "biosample_set", "batchSize": 10})
# etc., until cursor id is `0`.
PeopleMakeCulture commented 9 months ago

@aclum We have started developing this endpoint and can currently return the results of one aggregate command. That means it can return results of up to 16MB. Would it be helpful for you to have access to this interim stage queries:run endpoint for now, as we build out the paging functionality?

PeopleMakeCulture commented 9 months ago

Link to relevant PR: https://github.com/microbiomedata/nmdc-runtime/compare/422-add-aggregation-command

dwinston commented 9 months ago

so, it turns out that cursors for aggregate commands don't persist either -- my best guess at why it worked for me via pymongo in a python shell session is that pymongo still starts an implicit session for commands, even though I thought that was discontinued.

The approach I think we'll take for this now is: 1) append an $out stage to the user-supplied aggregation pipeline, to send results to a temporary mongo collection. 2) call nmdc_runtime.api.endpoints.util.find_resources to use our custom cursor functionality that is currently in service for the find endpoints, so that one can retrieve all aggregation results if they exceed 16MB (the mongodb bson document size limit). 3) ensure the temporary collection is cleaned up (e.g. via a dagster schedule)

aclum commented 9 months ago

Yes, that would be useful.

PeopleMakeCulture commented 9 months ago

New ticket to extend aggregate query:run with paging: https://github.com/microbiomedata/nmdc-runtime/issues/460