Milestone - Add support for all queries available in the data portal available via the public API (4.8)

ssarrafan commented 9 months ago

Programmatic access to NMDC data for broad community use Web APIs enable programmatic access to data and computation in a scalable, automated fashion. This allows users to orchestrate data access into scriptable tools, and is key to enabling complex, repeatable interactions with the data. They also enable integrations with other web services. We will enable a public web API for the NMDC to allow broad access to our data for programmatic and automated access (Data Portal, Milestone 4.7 & 4.10). The NMDC public API will provide a path to interact with the NMDC central metadata store. The API will be focused on data access including search, query, download, and bulk data delivery. It will provide an access point for querying and referencing data through persistent identifiers, which will enable integrations with workflow pipelines and other services outside the NMDC such as KBase. This will allow for more complex interactions such as custom multi-step queries and reproducible data analyses in the form of interactive scripts or notebooks. For example, a researcher interested in exploring nitrogen cycling processes across biomes and available multi-omics data could leverage this programmatic approach requiring multiple lookup operations to retrieve the complete information about the studies, samples, and processed data available within the NMDC. We anticipate that these types of complex, cross-study searches can be refined and further developed through our user research activities to ensure that we are tackling and prioritizing the right features in our API (Data Portal, Milestone 4.9).

(couldn't find 4.8 in proposal)

Pages 35-36

aclum commented 8 months ago

We have a number of useful collection endpoints that provide queries that match those available in the data portal but these primarily use a single collection (ie biosample endpoint, study endpoint, collection_name).

We need to add support for combining collections (ie study X where the processing institution is Y) and make bulk download with a filter applied easier (ie all filtered read fastq data from study X).

aclum commented 5 months ago

@dwinston is there a way to pass an aggregation query to queries:run or another endpoint?

aclum commented 5 months ago

see https://github.com/microbiomedata/nmdc-runtime/issues/401

dwinston commented 5 months ago

@dwinston is there a way to pass an aggregation query to queries:run or another endpoint?

@aclum not currently, but allowing aggregation commands to queries:run wouldn't be too hard of a lift.

aclum commented 5 months ago

@dwinston Allowing aggregation commands, either through queries: run or a new endpoint would be very helpful towards this milestone and as an interim solution until we can have graph-based endpoints to do the traversing for users. As things stand now you have to download collections, figure out the next field you want to query, then run another API query. This came up when discussing figuring out how to get some of the annotation files and proteomics raw data that come from a matched biosample, the combination of which is required as the input to the proteomics pipeline. Typical starting identifier would be a biosample_set id, a omics_processing_set id where omics_type.has_raw_value=Proteomics, or a metaproteomics_analysis_activity_set id. Michal wrote a bit of python code do to this but it would be nice to be able to do this in a single API query. cc @shreddd for awareness.

dwinston commented 5 months ago

There is some silliness re: mongo find command cursors not being valid for the mongo getMore command, but for some reason aggregate command cursors work fine. See below. So, I think we can allow aggregation commands with proper paging via POST /queries:run.

@PeopleMakeCulture want to take a stab at a draft PR for this?

rv = mdb.command("aggregate", "biosample_set", pipeline=[{"$match": {}}], cursor={"batchSize": 10})
rv = mdb.command({"getMore": rv['cursor']['id'], "collection": "biosample_set", "batchSize": 10})
# etc., until cursor id is `0`.

PeopleMakeCulture commented 5 months ago

Splitting this request to microbiomedata/nmdc-runtime#447 Add Aggregation Commands to Queries:Run Endpoint

ssarrafan commented 4 months ago

@aclum any update on this milestone? Can this be completed by the end of March 2024?

PeopleMakeCulture commented 4 months ago

@ssarrafan Issue 401 in mndc-runtime most closely tracks progress against this milestone.

@dwinston does that timeline sound reasonable to you?

aclum commented 4 months ago

I will make a new ticket for this but what I'd like to see here to mark this complete is 1) a public endpoint that allows aggregation (currently this can only be done through queries:run which requires nmdc login) 2) https://github.com/microbiomedata/nmdc-runtime/issues/401 which is already in progress.

cc @shreddd

dwinston commented 4 months ago

Can this be completed by the end of March 2024?

Yes.

ssarrafan commented 3 months ago

@aclum is this milestone done? Can I get an update/accomplishments for the DOE quarterly please? It's due this quarter. Thanks!

aclum commented 3 months ago

We marked this milestone done because of the added support for aggregation queries, https://github.com/microbiomedata/nmdc-runtime/issues/482

microbiomedata / issues

Milestone - Add support for all queries available in the data portal available via the public API (4.8) #496