AlanSimmons commented 1 year ago

Request

Provide a set of parameterized endpoints that simplify queries of HubMAP/SenNet data. These endpoints would provide a layer of abstraction for "convenience searches" that would allow users to query data without having to construct ElasticSearch DSL queries.

Background

As the README for search-api states, search-api is a "thin wrapper of ElasticSearch." The current search (and _search_byindex) endpoints allow the execution of queries against HuBMAP ElasticSearch indexes.

Queries beyond simple searches on things like assay names require specifying search parameters using ElasticSearch DSL--e.g.,

{
  "query": {
    "bool": {
      "must": [
        {
          "match_phrase": {
            "donor.group_name": "Vanderbilt TMC"
          }
        }
      ],
      "filter": [
        {
          "match": {
            "origin_sample.entity_type": "Sample"
          }
        }
      ]
    }
  }
}

We think that many consumers of the search-api might find the requirement to describe searches with DSL onerous. We also think that consumers will expect "more RESTful" endpoints that allow lists of search parameters

Solution

Develop endpoints in format https://search.api.hubmapconsortium.org/ search entity?list of index attributes

Search entities: There could be endpoints for a subset of the available search indexes, such as:

dataset
sample
donor
collection
file
antibody

The parameters would correspond to attributes on the search indexes.

For example, an endpoint that returned all CODEX datasets for heart samples might look like:

https://search.api.hubmapconsortium.org/dataset?organ=HT&data_type=codex

It should be possible to filter response to a subset of attributes instead of full documents--e.g., https://search.api.hubmapconsortium.org/dataset?organ=HT&data_type=codex&returned_attributes=group_name%2Cdonor%2Chubmap_id%2Cuuid%2Cimmediate_ancestors

The default return (if returned_attributes is not specified) would include the entire document.

It may also be convenient to provide "collections" of return attributes that are requested together. For example, if the attributes in the prior example were often requested, the attributes could be part of a collection named collection1 and the call could be https://search.api.hubmapconsortium.org/datasets?organ=HT&data_type=codex&&returned_attribute_collection=collection1

Known, high-level tasks

Analysis: characterize indexes in terms of attributes to build list of parameters to use
Build endpoints that translate index/attribute combinations into DSL payloads
Publish user guide (SmatAPI might be enough)

Notes

This enhancement applies to both HuBMAP and SenNet, which share (or at least copy) code base. It should be possible to configure the search-api so that it works in the appropriate application context.
This feature will need to accommodate both public and private searches. This could be as simple as using the existing authorization mechanism.

AlanSimmons commented 1 year ago

Analysis and Endpoint Use Cases

search-api DSL to parameter abstraction.xlsx

In Google Drive: https://docs.google.com/spreadsheets/d/1EpQVREOr33-5mh3LJ8-ZFmmkJquDni9zShQB0r3vqr8/edit?usp=sharing

Attributes

The attached document contains information for attributes for consortium-level HubMAP ElasticSearch indexes:

hm_dev_consortium_entities
hm_antibodies
hm_dev_consortium_files

The ElasticSearch index attributes correspond to values in the return from the search-api, flattened by dot notation. The response JSON nests up to 4 levels. For example, the attribute ancestors.metadata.cold_ischemia_time_unit corresponds to a path in the response similar to:

{
...
    {"ancestors":[
           {"metadata": [
               "cold_ischema_time_unit:"

If a document is associated with an member of an Entity Provenance Hierarchy (e.g., donor>sample>dataset), the index will include information that helps to locate the document in the hierarchy.

The Entity Provenance ontology organizes information in ways that include:

Donor -> Sample -> Dataset
Collection -> Dataset

The Entity Provenance elements relate with ancestor and descendant relationships.

Elements can contain other elements of the same entity type hierarchically, to represent division or derivation--e.g.,

A Sample of type organ can be the ancestor of a Sample of type organ_piece.
A primary Dataset can be the ancestor of a derived Dataset entity.

The introduction tab of the spreadsheet describes the contents of the rest of the document.

Endpoint Use Cases

The spreadsheet identifies 8 use cases that could be satisfied with new endpoints in the search-api.

shirey commented 1 year ago

@AlanSimmons We need to define, if possible, a way to do paging as there are limits to how much data we can return at one time (both from Elasticsearch and RESTful responses)

hubmapconsortium / search-api

EPIC - search-api-endpoint enhancements #590