monarch-initiative / biolink-api

API for linked biological knowledge
https://api.monarchinitiative.org/api/
BSD 3-Clause "New" or "Revised" License
64 stars 25 forks source link

Include taxon id with taxon label in facet count of entity search endpoint #386

Open vincerubinetti opened 2 years ago

vincerubinetti commented 2 years ago

I'm developing the 3.0 version of the monarch ui/website, and I've run into a limitation. @putmantime

Here is an example response from the /search/entity/{term} endpoint, searching "ssh":

{
  "numFound": 177,
  "docs": [
    {
      "id": "FlyBase:FBgn0029157",
      "id_std": "FlyBase:FBgn0029157",
      "id_eng": "FlyBase:FBgn0029157",
      "id_kw": "FlyBase:FBgn0029157",
      "prefix": "FlyBase",
      "label": ["ssh"],
      "label_std": ["ssh"],
      "label_eng": ["ssh"],
      "label_kw": ["ssh"],
      "edges": 319,
      "taxon": "NCBITaxon:7227",
      "taxon_std": "NCBITaxon:7227",
      "taxon_eng": "NCBITaxon:7227",
      "taxon_kw": "NCBITaxon:7227",
      "taxon_label": "Drosophila melanogaster",
      "taxon_label_std": "Drosophila melanogaster",
      "taxon_label_eng": "Drosophila melanogaster",
      "taxon_label_kw": "Drosophila melanogaster",
      "taxon_label_synonym": ["fruit fly", "Sophophora melanogaster"],
      "taxon_label_synonym_std": ["fruit fly", "Sophophora melanogaster"],
      "taxon_label_synonym_eng": ["fruit fly", "Sophophora melanogaster"],
      "taxon_label_synonym_kw": ["fruit fly", "Sophophora melanogaster"],
      "has_phenotype": false,
      "category": ["gene", "sequence feature"],
      "category_std": ["gene", "sequence feature"],
      "category_eng": ["gene", "sequence feature"],
      "category_kw": ["gene", "sequence feature"],
      "synonym": [
        "slingshot",
        "Dmel\\CG6238",
        "SSH",
        "Ssh",
        "MKP-like",
        "Slingshot",
        "CG6238-PA",
        "Mkph",
        "CG6238-PB",
        "CG6238",
        "MKP",
        "CG6238-PC",
        "CG6238-PD",
        "ssh-PB",
        "ssh-PA",
        "ssh-PD",
        "ssh-PC",
        "l(3)01207",
        "MAP-kinase-phosphatase"
      ],
      "synonym_std": [
        "slingshot",
        "Dmel\\CG6238",
        "SSH",
        "Ssh",
        "MKP-like",
        "Slingshot",
        "CG6238-PA",
        "Mkph",
        "CG6238-PB",
        "CG6238",
        "MKP",
        "CG6238-PC",
        "CG6238-PD",
        "ssh-PB",
        "ssh-PA",
        "ssh-PD",
        "ssh-PC",
        "l(3)01207",
        "MAP-kinase-phosphatase"
      ],
      "synonym_eng": [
        "slingshot",
        "Dmel\\CG6238",
        "SSH",
        "Ssh",
        "MKP-like",
        "Slingshot",
        "CG6238-PA",
        "Mkph",
        "CG6238-PB",
        "CG6238",
        "MKP",
        "CG6238-PC",
        "CG6238-PD",
        "ssh-PB",
        "ssh-PA",
        "ssh-PD",
        "ssh-PC",
        "l(3)01207",
        "MAP-kinase-phosphatase"
      ],
      "synonym_kw": [
        "slingshot",
        "Dmel\\CG6238",
        "SSH",
        "Ssh",
        "MKP-like",
        "Slingshot",
        "CG6238-PA",
        "Mkph",
        "CG6238-PB",
        "CG6238",
        "MKP",
        "CG6238-PC",
        "CG6238-PD",
        "ssh-PB",
        "ssh-PA",
        "ssh-PD",
        "ssh-PC",
        "l(3)01207",
        "MAP-kinase-phosphatase"
      ],
      "equivalent_curie": [
        "FB:FBgn0029157",
        "NCBIGene:42986",
        "NCBI-Gene:42986",
        "NCBI.Gene:42986",
        "Entrez:42986",
        "Entrez.Gene:42986",
        "EntrezGene:42986",
        "Entrez-Gene:42986",
        "Gene:42986",
        "ENSEMBL:FBgn0029157"
      ],
      "equivalent_curie_std": [
        "FB:FBgn0029157",
        "NCBIGene:42986",
        "NCBI-Gene:42986",
        "NCBI.Gene:42986",
        "Entrez:42986",
        "Entrez.Gene:42986",
        "EntrezGene:42986",
        "Entrez-Gene:42986",
        "Gene:42986",
        "ENSEMBL:FBgn0029157"
      ],
      "equivalent_curie_eng": [
        "FB:FBgn0029157",
        "NCBIGene:42986",
        "NCBI-Gene:42986",
        "NCBI.Gene:42986",
        "Entrez:42986",
        "Entrez.Gene:42986",
        "EntrezGene:42986",
        "Entrez-Gene:42986",
        "Gene:42986",
        "ENSEMBL:FBgn0029157"
      ],
      "equivalent_curie_kw": [
        "FB:FBgn0029157",
        "NCBIGene:42986",
        "NCBI-Gene:42986",
        "NCBI.Gene:42986",
        "Entrez:42986",
        "Entrez.Gene:42986",
        "EntrezGene:42986",
        "Entrez-Gene:42986",
        "Gene:42986",
        "ENSEMBL:FBgn0029157"
      ],
      "leaf": true,
      "_version_": 1696524917734899700,
      "score": 117.35552
    }
  ],
  "facet_counts": {
    "category": {
    },
    "taxon_label": {
      "Sus scrofa": 25,
      "Drosophila melanogaster": 21,
      "Homo sapiens": 18,
      "Mus musculus": 16,
      "Bos taurus": 6,
      "Saccharomyces cerevisiae S288C": 6,
      "Xenopus tropicalis": 6,
      "Danio rerio": 5,
      "Gallus gallus": 4,
      "Anolis carolinensis": 3,
      "Canis lupus familiaris": 3,
      "Felis catus": 3,
      "Macaca mulatta": 3,
      "Monodelphis domestica": 3,
      "Ornithorhynchus anatinus": 3,
      "Pan troglodytes": 3,
      "Rattus norvegicus": 3,
      "Takifugu rubripes": 3,
      "Equus caballus": 2
    }
  },
  "highlighting": {}
}

Notice that taxon_label is being returned for facets, instead of taxon (id). This is nice for displaying a list of taxon facets, but not for actually filtering by them, because the endpoint only supports filtering by taxon (id), not taxon_label.

This requires the frontend to make a hard-coded label to id mapping for taxons. This duplicates information that we already have in biolink, is brittle, and is likely to get out of sync.

And yes, I can look up taxon from docs by finding the corresponding taxon_label field. However, then I would need to make sure all results are in docs so I have all the mappings, and that might go beyond the max rows [per page] param.


Possible solutions:

falquaddoomi commented 2 years ago

It's not exactly what you're asking for, but would a facet structure like this work?:

"facet_counts": {
    "category": {
        "disease": 27,
        "publication": 9,
        "anatomical entity": 5,
        "cell": 5,
        "gene": 2,
        "sequence feature": 2,
        "phenotype": 1,
        "quality": 1
    },
    "taxon": {
        "NCBITaxon:9031": 1,
        "NCBITaxon:9606": 1
    },
    "taxon_label": {
        "Gallus gallus": 1,
        "Homo sapiens": 1
    },
    "_taxon_map": {
        "NCBITaxon:9031": {
            "Gallus gallus": 1
        },
        "NCBITaxon:9606": {
            "Homo sapiens": 1
        }
    }
}

Two things are different here: 1) there's a new taxon facet that groups results by taxon ID, and 2) there's a _taxon_map entry in facet_counts that groups first by taxon ID, then by taxon label, with the value being the count of both that ID and label. AFAIK there should be a one-to-one mapping between ID and label, so there'll always just be one child of the ID node, but just in case there isn't this structure will still work.

If so, I have this implemented in my fork of the ontobio library -- here's where the _taxon_map key is injected into the facet counts: https://github.com/falquaddoomi/ontobio/blob/92231d447a/ontobio/golr/golr_query.py#L603. I assume we'll have to figure out who downstream might be affected by this...maybe the best way is to submit a PR?

vincerubinetti commented 2 years ago

That's fine with me. If this is easier to implement or more consistent with how other things and data structures in biolink are implmented, I'd say go for it.

putmantime commented 2 years ago

Is the main reason you chose that structure because it supports 1 to many id to label mappings Faisal? I don't believe that will be the case as we have chosen the NCBI id/label pair for a taxon.
If what I say is true I think the most explicit and easily readable structure would be an object for each with clear attributes. "_taxon_map": [{ "label": "Gallus gallus", "id": "NCBITaxon:9031", "count": 1 } ]

But is a list of objects going to cause even more issues in this case @vincerubinetti ?

falquaddoomi commented 2 years ago

I formatted it that way partly because I wasn't sure if there might be more than one label that matches a given taxon ID, and also because that structure kind of more closely matches how facet pivots are returned from Solr. If IDs and labels are in fact one-to-one I agree that the structure you proposed is more readable, and it's a trivial change on my end.

putmantime commented 2 years ago

Let me do some research and see if I can confirm 1to1. The typical return type from solr was something I wasn't sure of and standardizing to that might be of more value than the clarity of my proposed structure.