hubmapconsortium / search-api

HuBMAP search service and associated pieces to create an index
https://search.api.hubmapconsortium.org
MIT License
2 stars 2 forks source link

EPIC - search-api-endpoint enhancements #590

Open AlanSimmons opened 1 year ago

AlanSimmons commented 1 year ago

Request

Provide a set of parameterized endpoints that simplify queries of HubMAP/SenNet data. These endpoints would provide a layer of abstraction for "convenience searches" that would allow users to query data without having to construct ElasticSearch DSL queries.

Background

As the README for search-api states, search-api is a "thin wrapper of ElasticSearch." The current search (and _search_byindex) endpoints allow the execution of queries against HuBMAP ElasticSearch indexes.

Queries beyond simple searches on things like assay names require specifying search parameters using ElasticSearch DSL--e.g.,

{
  "query": {
    "bool": {
      "must": [
        {
          "match_phrase": {
            "donor.group_name": "Vanderbilt TMC"
          }
        }
      ],
      "filter": [
        {
          "match": {
            "origin_sample.entity_type": "Sample"
          }
        }
      ]
    }
  }
} 

We think that many consumers of the search-api might find the requirement to describe searches with DSL onerous. We also think that consumers will expect "more RESTful" endpoints that allow lists of search parameters

Solution

Develop endpoints in format https://search.api.hubmapconsortium.org/ search entity?list of index attributes

  1. Search entities: There could be endpoints for a subset of the available search indexes, such as:
  1. The parameters would correspond to attributes on the search indexes.

For example, an endpoint that returned all CODEX datasets for heart samples might look like:

https://search.api.hubmapconsortium.org/dataset?organ=HT&data_type=codex

  1. It should be possible to filter response to a subset of attributes instead of full documents--e.g., https://search.api.hubmapconsortium.org/dataset?organ=HT&data_type=codex&returned_attributes=group_name%2Cdonor%2Chubmap_id%2Cuuid%2Cimmediate_ancestors

The default return (if returned_attributes is not specified) would include the entire document.

  1. It may also be convenient to provide "collections" of return attributes that are requested together. For example, if the attributes in the prior example were often requested, the attributes could be part of a collection named collection1 and the call could be https://search.api.hubmapconsortium.org/datasets?organ=HT&data_type=codex&&returned_attribute_collection=collection1

Known, high-level tasks

  1. Analysis: characterize indexes in terms of attributes to build list of parameters to use
  2. Build endpoints that translate index/attribute combinations into DSL payloads
  3. Publish user guide (SmatAPI might be enough)

Notes

  1. This enhancement applies to both HuBMAP and SenNet, which share (or at least copy) code base. It should be possible to configure the search-api so that it works in the appropriate application context.
  2. This feature will need to accommodate both public and private searches. This could be as simple as using the existing authorization mechanism.
AlanSimmons commented 1 year ago

Analysis and Endpoint Use Cases

search-api DSL to parameter abstraction.xlsx

In Google Drive: https://docs.google.com/spreadsheets/d/1EpQVREOr33-5mh3LJ8-ZFmmkJquDni9zShQB0r3vqr8/edit?usp=sharing

Attributes

The attached document contains information for attributes for consortium-level HubMAP ElasticSearch indexes:

The ElasticSearch index attributes correspond to values in the return from the search-api, flattened by dot notation. The response JSON nests up to 4 levels. For example, the attribute ancestors.metadata.cold_ischemia_time_unit corresponds to a path in the response similar to:

{
...
    {"ancestors":[
           {"metadata": [
               "cold_ischema_time_unit:"

If a document is associated with an member of an Entity Provenance Hierarchy (e.g., donor>sample>dataset), the index will include information that helps to locate the document in the hierarchy.

The Entity Provenance ontology organizes information in ways that include:

  1. Donor -> Sample -> Dataset
  2. Collection -> Dataset

The Entity Provenance elements relate with ancestor and descendant relationships.

Elements can contain other elements of the same entity type hierarchically, to represent division or derivation--e.g.,

  1. A Sample of type organ can be the ancestor of a Sample of type organ_piece.
  2. A primary Dataset can be the ancestor of a derived Dataset entity.

The introduction tab of the spreadsheet describes the contents of the rest of the document.

Endpoint Use Cases

The spreadsheet identifies 8 use cases that could be satisfied with new endpoints in the search-api.

shirey commented 1 year ago

@AlanSimmons We need to define, if possible, a way to do paging as there are limits to how much data we can return at one time (both from Elasticsearch and RESTful responses)