Endpoint for multi-entity association query

oneilsh commented 1 year ago

Now that we have a multi-entity association query on the backend, we need to expose it via the api-v3.monarchinitiative.org API :D Ref https://github.com/monarch-initiative/monarch-app/issues/270

glass-ships commented 12 months ago

See #369

oneilsh commented 11 months ago

Heya @glass-ships, @kevinschaper, would it be possible to make the API results look more like the example in #270? Here's how the result currently looks (just the beginning, requesting a limit of 5 and offset of 0 for a few entities and categories):

[
    {
      "limit": 5,
      "offset": 0,
      "total": 1,
      "id": "MONDO:0017309",
      "name": "neonatal Marfan syndrome",
      "associated_categories": [
        {
          "limit": 20,
          "offset": 0,
          "total": 1,
          "counterpart_category": "biolink:Gene",
          "items": [
            {
              "id": "uuid:c30c59d2-5d8c-11ee-9b27-2b20ed86a9d9",
              "subject": "HGNC:3603",
              "original_subject": "NCBIGene:2200",
              "subject_namespace": "HGNC",
              "subject_category": "biolink:Gene",
              "subject_closure": [],
              "subject_label": "FBN1",
              "subject_closure_label": [],
              "subject_taxon": "NCBITaxon:9606",
              "subject_taxon_label": "Homo sapiens",
              "predicate": "biolink:gene_associated_with_condition",
              "object": "MONDO:0017309",
              "original_object": "Orphanet:284979",

A few notes:

I'm not sure if we need to return the requested limit and offset to the user. But maybe doing so is standard in REST though, like if defaults are used? In which case I would do so at the very top level only (there seems to be a bug as well, since its reported at multiple levels, and as 5 as requested and 20)
We might want to rename the offset and limit params to make it clear that those are per entity/category group
Maybe rename 'items' to associations?
I don't think we should report subject and object info in the items, which requires logic to determine if the info I'm looking for is in the subject or the object (which depends on the direction of the association, and repeats a lot of info). Rather, we should report just the counterpart info; we could call it 'counterpart', 'counterpart_label', etc. to make it clear. We could add info about the direction maybe, for example if the predicate is 'biolink:gene_associated_with_condition; then the gene is the subject, so we could report something like 'counterpart_role': 'subject'.
All the other association-specific info (onset, stage, sex etc) can stay at the lowest association level
If we want to keep the metadata for the searched entity, rather than having it as subject/object info in the lowest level, it could be at the level of the entity, so 'entity_category', 'entity_label', 'entity_closure' etc.
Let's not include totals that can be computed by summing over lower-level totals to avoid confusion

So, I'm thinking a search on association/multi?entity=MONDO%3A0017309&entity=HP%3A0000973&counterpart_category=biolink%3AGene&counterpart_category=biolink%3APhenotypicFeature&limit=5&offset=0 might look something like:

{
    "offset_per_associated_category": 0,
    "limit_per_associated_category": 5,
    "entities": [
        {
          "entity": "MONDO:0017309",
          "entity_name": "neonatal Marfan syndrome",
          "original_entity": "Orphanet:284979",
          "entity_namespace": "MONDO",
          "entity_category": "biolink:Disease",
          "entity_closure": [
            "MONDO:0000001",
            "MONDO:0018230",
            ...
          ],
          "entity_label": "neonatal Marfan syndrome",
          "entity_closure_label": [
            "entity",
            "continuant",
            ...
          ],
          "entity_taxon": null,
          "entity_taxon_label": null,

          "associated_categories": [
            {
              "total": 1,
              "counterpart_category": "biolink:Gene",
              "associations": [
                {
                  "id": "uuid:c30c59d2-5d8c-11ee-9b27-2b20ed86a9d9",

                  "counterpart": "HGNC:3603",
                  "original_counterpart": "NCBIGene:2200",
                  "counterpart_namespace": "HGNC",
                  "counterpart_category": "biolink:Gene",
                  "counterpart_closure": [],
                  "counterpart_label": "FBN1",
                  "counterpart_closure_label": [],
                  "counterpart_taxon": "NCBITaxon:9606",
                  "counterpart_taxon_label": "Homo sapiens",

                  "counterpart_role": "subject",
                  "predicate": "biolink:gene_associated_with_condition",

                  "primary_knowledge_source": "infores:orphanet",
                  "aggregator_knowledge_source": [
                    "infores:monarchinitiative"
                  ],
                  "category": "biolink:CorrelatedGeneToDiseaseAssociation",
                  "negated": null,
                  "provided_by": "hpoa_gene_to_disease_edges",
                  "provided_by_link": {
                    "id": "hpoa_gene_to_disease",
                    "url": "https://monarch-initiative.github.io/monarch-ingest/Sources/hpoa/#gene_to_disease"
                  },
                  "publications": [],
                  "qualifiers": [],
                  "frequency_qualifier": null,
                  "has_evidence": [],
                  "onset_qualifier": null,
                  "sex_qualifier": null,
                  ... # other association-specific data
                  "stage_qualifier_closure": [],
                  "stage_qualifier_closure_label": []
                }
              ]
            },
            {
              "total": 43,
              "counterpart_category": "biolink:PhenotypicFeature",
              "associations": [
                ... (as above, showing only first 5)
              ]
            }
          ]
        },
        {
          "entity": "HP:0000973",
          "entity_name": "Cutis laxa (HPO)",
          ... (as above, including entity_namespace, entity_category, etc)

          "associated_categories": [
            {
              "total": 75,
              "counterpart_category": "biolink:Gene",
              "associations": [
                ... (as above, showing only 5)
              ]
            },
            {
              "total": 3,
              "counterpart_category": "biolink:PhenotypicFeature",
              "associations": [
                ... (as above, showing only 5)
              ]
            }
          ]
        }
    ]
}

I could possibly see some other minor UX improvements in the future, e.g. closure as a list of dict instead of entity_closure and entity_closure_label as two separate lists, but those would be for another day; here I'm just looking for the data model to match the intent of directionless-association search :)

glass-ships commented 11 months ago

I'm not sure if we need to return the requested limit and offset to the user. But maybe doing so is standard in REST though, like if defaults are used? In which case I would do so at the very top level only (there seems to be a bug as well, since its reported at multiple levels, and as 5 as requested and 20)

It being reported at multiple levels is a side effect of the LinkML data model we use - both the MultiEntityAssociationResults and CategoryGroupedAssociationResults extend the Results class, which contains required limit, offset, and total field.

We might want to rename the offset and limit params to make it clear that those are per entity/category group

Same thing here.

To address both these concerns, we could switch to creating entirely unique data classes for each of these that would deviate from the established Results model to which all other monarch-py responses adhere. For example, we'd need a secondary Entities class, and a secondary Associations class, each of which contains similar but distinctly different slots from their primary counterparts (since LinkML doesn't really allow the "renaming" of slots depending on what class they're a part of ((@kevinschaper feel free to correct me if I'm wrong on that last point))). But we definitely could.

Maybe rename 'items' to associations?

This should be doable without the above mentioned changes.

I don't think we should report subject and object info in the items, which requires logic to determine if the info I'm looking for is in the subject or the object (which depends on the direction of the association, and repeats a lot of info). Rather, we should report just the counterpart info; we could call it 'counterpart', 'counterpart_label', etc. to make it clear. We could add info about the direction maybe, for example if the predicate is 'biolink:gene_associated_with_condition; then the gene is the subject, so we could report something like 'counterpart_role': 'subject'.

This is a tricky part [for me, anyway], as I don't know the best way to generalize figuring out the "direction". Unless we have a complete set of all predicates in the Monarch KG and their predefined directions.
Or maybe it's as simple as "if entity is subject, counter_part role is object, and vice versa"?

All the other association-specific info (onset, stage, sex etc) can stay at the lowest association level

This is another tricky part, as Associations have a lot of slots:

See Here

classes: Association: slots: - id - category - subject - original_subject - subject_namespace - subject_category - subject_closure - subject_label - subject_closure_label - subject_taxon - subject_taxon_label - predicate - object - original_object - object_namespace - object_category - object_closure - object_label - object_closure_label - object_taxon - object_taxon_label - primary_knowledge_source - aggregator_knowledge_source - negated - pathway - provided_by - provided_by_link - publications - qualifiers - has_evidence - evidence_count - frequency_qualifier - onset_qualifier - sex_qualifier - stage_qualifier - frequency_qualifier_label - frequency_qualifier_namespace - frequency_qualifier_category - frequency_qualifier_closure - frequency_qualifier_closure_label - onset_qualifier_label - onset_qualifier_namespace - onset_qualifier_category - onset_qualifier_closure - onset_qualifier_closure_label - sex_qualifier_label - sex_qualifier_namespace - sex_qualifier_category - sex_qualifier_closure - sex_qualifier_closure_label - stage_qualifier_label - stage_qualifier_namespace - stage_qualifier_category - stage_qualifier_closure - stage_qualifier_closure_label

So I'm not immediately sure what would be copy-paste and what would be changed, or what a good name for this secondary Association class would be. I'll mull that over a bit.

If we want to keep the metadata for the searched entity, rather than having it as subject/object info in the lowest level, it could be at the level of the entity, so 'entity_category', 'entity_label', 'entity_closure' etc.

Let's not include totals that can be computed by summing over lower-level totals to avoid confusion

These last two points may just get handled as a consequence of creating these new dataclasses to use, but the logic will definitely take some chewing on to get right.

oneilsh commented 11 months ago

Thanks for thinking it over :) I see what you are getting at - I'm asking for something that isn't a great fit for the existing data models. Creating new classes might be needed, but let me think a bit as well to see if there's a more data-modely way to organize what I'm trying for.

One easy answer:

"Or maybe it's as simple as "if entity is subject, counter_part role is object, and vice versa"?

Yes, that's exactly what I was thinking :) But then, with a different data model this might be straightforward.

oneilsh commented 11 months ago

RE slots in classes - can slots be optional for some classes and required for others? I've been thinking of how to do a "light" return for common cases vs full-data return for everything-and-kitchen sink. Considering a very straightforward use where I want to do a keyword search to get a list of Entities back, if I set full_info to True in the query I get everything, but if I set it to False I just get back ID and label. But for another type of query maybe the description slot is included no matter what.

Edit: I suppose this could be done with like an EntityMinimal and EntityFull that inherits from it...

glass-ships commented 11 months ago

In the meantime I just opened a PR (#385) where I've just gone ahead and started making those required classes, and am in the process of changing up the logic

oneilsh commented 9 months ago

Asked @sagehrke to put a pin in this, my understanding of the KG data model has evolved and I now better understand why this is a tricky ask. What I'm looking for may be better suited to answering via the new neo4j deployment and if what I come up with is more broadly useful perhaps it can go to the API from there.

monarch-initiative / monarch-app

Endpoint for multi-entity association query #365