monarch-initiative / biolink-api

API for linked biological knowledge
https://api.monarchinitiative.org/api/
BSD 3-Clause "New" or "Revised" License
64 stars 25 forks source link

Taxon Label to ID Mapper Endpoint #392

Closed falquaddoomi closed 2 years ago

falquaddoomi commented 2 years ago

This PR adds a new endpoint, /ontol/identifier/, that accepts a list of labels and produces matching IDs. Specifically, it queries for the label side of the <label> rdfs:label <label> relation, much like the existing /ontol/labeler/ except over labels and not IDs. All matching results are returned for a given label, unlike /onto/labeler/ which returns just the first result. The results are returned in the following format:

{
  "<label>": [<ID:str>, ... ], ...
}

A list of labels can be supplied as the label parameter, via either GET (as a querystring param, e.g. ?label=<first>&label=<second>...) or POST (as a querystring param or in a JSON-encoded body like {'label': [<first>, <second>, ...]}.

This PR requires ontobio to be up to commit https://github.com/monarch-initiative/ontobio/commit/91222c8b442196d6eeeafeb6073946494e8a3a10. (I'll issue individual PRs to the main ontobio repo once the short-term UI needs are settled.)

Closes issue #391.

vincerubinetti commented 2 years ago

@falquaddoomi I just realized there's probably no where this change is deployed where I can test it. What I was going to do is take this list of example taxon labels...

"Sus scrofa",
"Drosophila melanogaster",
"Homo sapiens",
"Mus musculus",
"Bos taurus",
"Saccharomyces cerevisiae S288C",
"Xenopus tropicalis",
"Danio rerio",
"Gallus gallus",
"Anolis carolinensis",
"Canis lupus familiaris",
"Felis catus",
"Macaca mulatta",
"Monodelphis domestica",
"Ornithorhynchus anatinus",
"Pan troglodytes",
"Rattus norvegicus",
"Takifugu rubripes",
"Equus caball",

... and make sure they can map to ids, and then back to labels again, without any "loss" (id mapping to multiple labels or vice versa, or failing to find a match). Would you be able to test this locally?

falquaddoomi commented 2 years ago

Hey @vincerubinetti, sure, I can test that locally. I'll also make a test case out of it.

Also, for future situations like this, I'm currently writing a script to deploy a temporary "preview" VM with the upcoming biolink-api version running on it. I'll see if I can integrate it into the PR process so it can be used for testing. (The VM will be marked preemtible, so it'll be both low-cost and will be terminated after at most 24 hours.)

vincerubinetti commented 2 years ago

Just tested it for the first 5 examples. It worked for all of them except for "Drosophila melanogaste" where it returned nothing. But its ID, "NCBITaxon:7227", does return the label in the labeler endpoint. The hard coded mapping in UI 2.0 and 3.0 also happens to contain it.

I'm guessing this is some deeper (data quality?) issue, rather than something in this PR? If so, not sure what to do here.

seandavi commented 2 years ago

Not sure if it is as simple as this, but "Drosophila melanogaste" has an "R" at the end: "Drosophila melanogaster"

falquaddoomi commented 2 years ago

So, I coded up the test you proposed and there were a few issues:

  1. "Equus caball" has no ID matches -- perhaps that's supposed to be "Equus caballus"? Using that produces a clean map for that entry.
  2. There are apparently multiple labels in the database that map to an ID 😬 . For example, the new /ontol/identifier/ endpoint maps "Felis catus" to NCBITaxon:9685, but /ontol/labeler/ (the old ID-to-label endpoint) maps that ID to "cat".

Specifically, there are three examples in the fixed label list that don't map to the labeler:

Fortunately querying /ontol/identifier/ for 'cat', does produce NCBITaxon:9685, but it's not the first element in the list. I should probably amend /onto/labeler to return all the possible labels and not just the first one, since the ordering is apparently arbitrary.

vincerubinetti commented 2 years ago

@seandavi That was the problem, I didn't copy it properly.

Apologies yes, "caballus" was a typo. FWIW this list is the taxon facets returned from searching for "SSH":

https://api.monarchinitiative.org/api/search/entity/SSH?boost_q=category:disease^5&boost_q=category:phenotype^5&boost_q=category:gene^0&boost_q=category:genotype^-10&boost_q=category:variant^-35&min_match=67%&prefix=-OMIA&rows=10&start=0

I'm not sure what to do about the other problems though :/ @putmantime ?

falquaddoomi commented 2 years ago

Yeah, I don't know...well, the good news is that all of these results are produced from the same set of (ID, label) pairs, so if you get a label or an ID back from one endpoint you're guaranteed to get a result when querying the other endpoint. It might not be the result you expect (especially because /ontol/labeler/ just returns the first element, so they're definitionally not reversible functions), but you're guaranteed to get something.

vincerubinetti commented 2 years ago

Perhaps the solution is to simply be returning a "taxon id" facet instead of a "taxon label" facet. That seems to be more unambiguous. With the ids I can then just make use of the labeler endpoint to get nice human readable labels to display to the user. For this fix, we'd want to make sure we do this in ALL of the cases where biolink is returning taxon label facets. So far the only places I've seen this is the search endpoint and all of the association endpoints, but I bet there are more.

We need input from @putmantime or someone from the tislab before we can continue.

vincerubinetti commented 2 years ago

A small addendum to this:

I'm in the process of incorporating this into the frontend, and I noticed that the search endpoints that take a taxon filter only seem to work with NCBITaxon ids, and not with OMIM and etc.

As such, you may want to make the NCBITaxon ids always show up first in the list of matches? I've made the frontend prefer them, but maybe it'd be good to put that in the backend too for anyone using the endpoint directly.

Also, perhaps it's time to delete the vestigial _taxon_map facet? At the moment I'm just explicitly deleting it from the facets.