API endpoint for verifying entity list for Upload target list

prashantuniyal02 commented 11 months ago

Creating an API endpoint for verifying entity list for enabling upload of a target/disease list

For a uploaded list of target, we need to match the uploaded entry to the following set of ids:

Ensembl
UniProt
HGNC

In case an uploaded entry matches to multiple results, we will display all the matched results.

For a uploaded list of diseases, we need to match the uploaded entry to the following set of ids:

EFO
(other to be confirmed)

We also need to confirm how to deal with entries that do not yield a match in both the backend and the frontend.

jdhayhurst commented 11 months ago

Input:

List of IDs (FE will parse file?)
Two input types:
- "target"
- Ensembl|UniProt|HGNC IDs
- "disease"
- EFO
For each ID:
- Assign Type or ignore
- Resolve Target or Disease - Nullable

Output:

List of Targets or Diseases

jdhayhurst commented 11 months ago

@prashantuniyal02 here are the fields in the API for a single target (ENSG00000001626). As I understand it, we'd like to be able to extract the full Target object based on either Ensembl ID: target.id, HGNC ID: target.approvedSymbol or uniprot. Uniprot is a bit more complex, in that there are many sources and one:many relationship between target and proteinId. Are we only checking against swissprot or any source?

{
  "data": {
    "target": {
      "id": "ENSG00000001626",
      "approvedSymbol": "CFTR",
      "proteinIds": [
        {
          "id": "P13569",
          "source": "uniprot_swissprot"
        },
        {
          "id": "A0A024R730",
          "source": "uniprot_trembl"
        },
        {
          "id": "A0A3B3IT97",
          "source": "uniprot_trembl"
        },
        {
          "id": "A0A3B3ITE0",
          "source": "uniprot_trembl"
        },
        {
          "id": "A0A3B3ITW0",
          "source": "uniprot_trembl"
        },
        {
          "id": "A0A3B3ITW5",
          "source": "uniprot_trembl"
        },
        {
          "id": "A0A669KBE8",
          "source": "uniprot_trembl"
        },
        {
          "id": "A0A8I5KVL1",
          "source": "uniprot_trembl"
        },
        {
          "id": "A0A8I5KVV2",
          "source": "uniprot_trembl"
        },
        {
          "id": "A0A8I5KXQ9",
          "source": "uniprot_trembl"
        },
        {
          "id": "A0A8V8TNG7",
          "source": "uniprot_trembl"
        },
        {
          "id": "A0A8V8TNH2",
          "source": "uniprot_trembl"
        },
        {
          "id": "A0A8V8TNN0",
          "source": "uniprot_trembl"
        },
        {
          "id": "A0A8V8TNN7",
          "source": "uniprot_trembl"
        },
        {
          "id": "A0A8V8TPV6",
          "source": "uniprot_trembl"
        },
        {
          "id": "A0A8V8TQ89",
          "source": "uniprot_trembl"
        },
        {
          "id": "A0A8V8TQ94",
          "source": "uniprot_trembl"
        },
        {
          "id": "C9J6L5",
          "source": "uniprot_trembl"
        },
        {
          "id": "E7EPB6",
          "source": "uniprot_trembl"
        },
        {
          "id": "H0Y8A9",
          "source": "uniprot_trembl"
        },
        {
          "id": "M0QYZ3",
          "source": "uniprot_trembl"
        },
        {
          "id": "Q20BG8",
          "source": "uniprot_obsolete"
        },
        {
          "id": "Q20BH2",
          "source": "uniprot_obsolete"
        },
        {
          "id": "Q2I0A1",
          "source": "uniprot_obsolete"
        },
        {
          "id": "Q2I102",
          "source": "uniprot_obsolete"
        }
      ]
    }
  }
}

prashantuniyal02 commented 11 months ago

I think using "uniprot_swissprot" makes the most sense

jdhayhurst commented 11 months ago

New plan based on discussions with @prashantuniyal02 and @d0choa is to use the "Bs" filter on the associatedDiseases and associatedTargets endpoint. This relies on OpenTargets Target/Disease Ids i.e. Ensembl/EFO Ids. To resolve these from a user uploaded list of targets/diseases, we will expose a batch search API, which will return a search object, similar to that of the existing Search. To begin with, the batch search will be exact matches only on the keywords and id fields of the targetSearch and diseaseSearch tables.

jdhayhurst commented 11 months ago

Having explored the search_target index, exact ("term" in ES terminology) queries to the keywords.raw field will enable us to resolve any of the keyword terms to a target ID. We can use a multi-term query to make all the queries in one round-trip. This should be fast because there are no analysis or scoring steps.

If we want to make the search behaviour, "match", i.e. non-exact, this can also be achieved but we'd expect the response to be slightly slower and it may introduce ambiguities. We could write the code so that the query type, exact or non-exact or both, is configurable? I'd suggest this, because in the case for resolving the target IDs, we need to be exact, but for most other searches in the platform this is unlikely to be desirable.

For the response, I think it should be list of SearchResults i.e. a list of what you would receive when you make a single search. Additionally, I would like to add the query into the SearchResults object, so that it's clear to the client, which results go with which query.

jdhayhurst commented 11 months ago

I've been digging into the search endpoint and I think making a generic batch search is not necessary for this use case.

First, the existing search endpoint already facilitates batch searching! It utilises the "simple query string" search which allows for these operators in the query string. So, assuming I understand the meaning of "batch search", you can already do this with the "OR" operator e.g. "ACHE|INS|ANG" on the target entity. Which is pretty cool!

Secondly, the current search approach and response is built on the principle that you are making full-text queries. The results are "hits" with "scores" etc. and the search operates in a specific way across the fields in the search indices. Here, we want to do something more simple, an "exact" term query on the keyword field of either the search_disease or search_target index. We specifically don't want any ambiguity that the full-text search may introduce.

For what we want to do this existing generic method, or something close to it should work. We can then return a response that is a mapping for each queried term. From the chat @carcruz and I had, the API could look something like (mappings and results, would be arrays):

query resolveTargets {
  keywordTermsQuery(
    terms: ["DNMT3A","LOC100130268", "Double-stranded RNA-specific editase 1", "ENSG00000225491", "not a target"], 
    entity: "target") {
    mappings {
      query
      isMapped
      results {
        id
      }
    }
  } 
}

On the other hand, we could expand the existing targets Query endpoint by adding another argument for terms e.g.:

query resolveTargets {
  targets(terms: ["DNMT3A","LOC100130268", "Double-stranded RNA-specific editase 1", "ENSG00000225491", "not a target"]) {
    id
  }
}

The issue with this option is you don't know what mapped to what, but I'm not sure yet how straightforward it will be to provide those mappings.

Do you have any thoughts or preferences on these @d0choa or @carcruz?

d0choa commented 11 months ago

@jdhayhurst if I understand this correctly, the question is how relevant is to know what mapped to what? or whether a term had a mapping at all? @carcruz thoughts?

jdhayhurst commented 11 months ago

Yes, basically, would you be happy with a response that's a list of search results (like the current search) or do you need the individual mappings between each term in the query list and it's own search results?

jdhayhurst commented 11 months ago

Using the existing search endpoint I was able to add an exact keyword matching option, isKeywordSearch, and keep the existing search response structure. If you run the following query without isKeywordSearch, it will give you 6111 hits, but with the exact keyword matching it gives exactly 3 (one to one for the 3 terms queried). Let me know if this is what you're after.

query SearchQuery {
  search(
    queryString: "ACHE|INS|ANG"
    entityNames: ["target"]
    page: {index: 0, size: 5}
    isKeywordSearch: true
  ) {
    total
    hits {
      id
      object {
        ... on Target {
          id
          approvedSymbol
        }
      }
    }
  }
}

response for above is:

{
  "data": {
    "search": {
    "total": 3,
      "hits": [
        {
          "id": "ENSG00000087085",
          "object": {
            "id": "ENSG00000087085",
            "approvedSymbol": "ACHE"
          }
        },
        {
          "id": "ENSG00000214274",
          "object": {
            "id": "ENSG00000214274",
            "approvedSymbol": "ANG"
          }
        },
        {
          "id": "ENSG00000254647",
          "object": {
            "id": "ENSG00000254647",
            "approvedSymbol": "INS"
          }
        }
      ]
    }
  }
}

jdhayhurst commented 11 months ago

After discussion with @d0choa and @carcruz, we agreed to move this behaviour to a separate endpoint, perhaps mapIds, or something similar. We can then customise the endpoint to suit the needs of the id mapping task without modifying the search endpoint in an undesirable way. Mappings could be achieved by exposing the "highlight" that comes back from ES in the positive case. For negative mappings (terms without hits), some post-processing is required, which if too complex, could be added in the next iteration.

jdhayhurst commented 11 months ago

Here's the custom endpoint for mapping IDs. Please can you let me know if this works for you @carcruz? The "total" is the number of hits, but not everything will necessary map. The unmapped terms still appear in the response, but don't have any hits - I think this is useful to know.

Request example for target id mapping (some map some don't)

query MappingQuery {
  mapIds(
    queryTerms: ["ACHE","INS","ANG","not going to map", "Double-stranded RNA-specific editase 1"]
    entityNames: ["target"]
  ) {
    total
    mappings {
      term
      hits {
        id
      }
    }
  }
}

Response

{
  "data": {
    "mapIds": {
      "total": 4,
      "mappings": [
        {
          "term": "ACHE",
          "hits": [
            {
              "id": "ENSG00000087085"
            }
          ]
        },
        {
          "term": "INS",
          "hits": [
            {
              "id": "ENSG00000254647"
            }
          ]
        },
        {
          "term": "ANG",
          "hits": [
            {
              "id": "ENSG00000214274"
            }
          ]
        },
        {
          "term": "not going to map",
          "hits": []
        },
        {
          "term": "Double-stranded RNA-specific editase 1",
          "hits": [
            {
              "id": "ENSG00000197381"
            }
          ]
        }
      ]
    }
  }
}

jdhayhurst commented 11 months ago

Just to note that the limit for the number of terms that can be queried at once is 65,536 (this is the Elastic default), but can be changed if we need.

d0choa commented 11 months ago

Functionally looks good. Questions:

Data

@DSuveges is asking how it will look when there is an ambiguous mapping. He suggests looking at DLC1

API-FE cc @carcruz:
- I feel we should probably take full advantage of GraphQL and resolve the hits using Target, Disease or Drug. This is the way the "SearchResult" is implemented. This should unblock more magic in the FE
- The previous would resolve the problem in which you might have a mixture of entities in the result. From the current response, there is no way to know what entity a given hit belongs to.
Do we want/need any kind of BE pagination, @carcruz?

jdhayhurst commented 11 months ago

@d0choa, I should have mentioned that the endpoint borrows the same entity and pagination logic as search. So you can specify entities and pages in the same way. It also inherits the same aggregation and search result objects from search, so for instance if you searched for a term on "target" and "disease" entities, you could return the entity fields like this (there are probably other ways to do it):

query MappingQuery {
  mapIds(
    queryTerms: ["ACHE"]
    entityNames: ["target", "disease"]
  ) {
    total
    mappings {
      term
      hits {
    entity
        id
      }
    }
  }
}

{
  "data": {
    "mapIds": {
      "total": 2,
      "mappings": [
        {
          "term": "ACHE",
          "hits": [
            {
              "entity": "target",
              "id": "ENSG00000087085"
            },
            {
              "entity": "disease",
              "id": "EFO_0003843"
            }
          ]
        }
      ]
    }
  }
}

Pagination to return the second page with a size of 1, would look like:

query MappingQuery {
  mapIds(
    queryTerms: ["ACHE"]
    entityNames: ["target", "disease"]
    page: {index: 1, size: 1}
  ) {
    total
    mappings {
      term
      hits {
    entity
        id
      }
    }
  }
}

{
  "data": {
    "mapIds": {
      "total": 2,
      "mappings": [
        {
          "term": "ACHE",
          "hits": [
            {
              "entity": "disease",
              "id": "EFO_0003843"
            }
          ]
        }
      ]
    }
  }
}

jdhayhurst commented 11 months ago

@DSuveges DLC1 looks like this:

query MappingQuery {
  mapIds(
    queryTerms: ["DLC1"]
    entityNames: ["target"]
  ) {
    total
    mappings {
      term
      hits {
        id
      }
    }
  }
}

{
  "data": {
    "mapIds": {
      "total": 3,
      "mappings": [
        {
          "term": "DLC1",
          "hits": [
            {
              "id": "ENSG00000088986"
            },
            {
              "id": "ENSG00000164741"
            },
            {
              "id": "ENSG00000008226"
            }
          ]
        }
      ]
    }
  }
}

opentargets / issues

API endpoint for verifying entity list for Upload target list #3114