Open oneilsh opened 1 year ago
See #369
Heya @glass-ships, @kevinschaper, would it be possible to make the API results look more like the example in #270? Here's how the result currently looks (just the beginning, requesting a limit of 5 and offset of 0 for a few entities and categories):
[
{
"limit": 5,
"offset": 0,
"total": 1,
"id": "MONDO:0017309",
"name": "neonatal Marfan syndrome",
"associated_categories": [
{
"limit": 20,
"offset": 0,
"total": 1,
"counterpart_category": "biolink:Gene",
"items": [
{
"id": "uuid:c30c59d2-5d8c-11ee-9b27-2b20ed86a9d9",
"subject": "HGNC:3603",
"original_subject": "NCBIGene:2200",
"subject_namespace": "HGNC",
"subject_category": "biolink:Gene",
"subject_closure": [],
"subject_label": "FBN1",
"subject_closure_label": [],
"subject_taxon": "NCBITaxon:9606",
"subject_taxon_label": "Homo sapiens",
"predicate": "biolink:gene_associated_with_condition",
"object": "MONDO:0017309",
"original_object": "Orphanet:284979",
A few notes:
offset
and limit
params to make it clear that those are per entity/category groupSo, I'm thinking a search on association/multi?entity=MONDO%3A0017309&entity=HP%3A0000973&counterpart_category=biolink%3AGene&counterpart_category=biolink%3APhenotypicFeature&limit=5&offset=0
might look something like:
{
"offset_per_associated_category": 0,
"limit_per_associated_category": 5,
"entities": [
{
"entity": "MONDO:0017309",
"entity_name": "neonatal Marfan syndrome",
"original_entity": "Orphanet:284979",
"entity_namespace": "MONDO",
"entity_category": "biolink:Disease",
"entity_closure": [
"MONDO:0000001",
"MONDO:0018230",
...
],
"entity_label": "neonatal Marfan syndrome",
"entity_closure_label": [
"entity",
"continuant",
...
],
"entity_taxon": null,
"entity_taxon_label": null,
"associated_categories": [
{
"total": 1,
"counterpart_category": "biolink:Gene",
"associations": [
{
"id": "uuid:c30c59d2-5d8c-11ee-9b27-2b20ed86a9d9",
"counterpart": "HGNC:3603",
"original_counterpart": "NCBIGene:2200",
"counterpart_namespace": "HGNC",
"counterpart_category": "biolink:Gene",
"counterpart_closure": [],
"counterpart_label": "FBN1",
"counterpart_closure_label": [],
"counterpart_taxon": "NCBITaxon:9606",
"counterpart_taxon_label": "Homo sapiens",
"counterpart_role": "subject",
"predicate": "biolink:gene_associated_with_condition",
"primary_knowledge_source": "infores:orphanet",
"aggregator_knowledge_source": [
"infores:monarchinitiative"
],
"category": "biolink:CorrelatedGeneToDiseaseAssociation",
"negated": null,
"provided_by": "hpoa_gene_to_disease_edges",
"provided_by_link": {
"id": "hpoa_gene_to_disease",
"url": "https://monarch-initiative.github.io/monarch-ingest/Sources/hpoa/#gene_to_disease"
},
"publications": [],
"qualifiers": [],
"frequency_qualifier": null,
"has_evidence": [],
"onset_qualifier": null,
"sex_qualifier": null,
... # other association-specific data
"stage_qualifier_closure": [],
"stage_qualifier_closure_label": []
}
]
},
{
"total": 43,
"counterpart_category": "biolink:PhenotypicFeature",
"associations": [
... (as above, showing only first 5)
]
}
]
},
{
"entity": "HP:0000973",
"entity_name": "Cutis laxa (HPO)",
... (as above, including entity_namespace, entity_category, etc)
"associated_categories": [
{
"total": 75,
"counterpart_category": "biolink:Gene",
"associations": [
... (as above, showing only 5)
]
},
{
"total": 3,
"counterpart_category": "biolink:PhenotypicFeature",
"associations": [
... (as above, showing only 5)
]
}
]
}
]
}
I could possibly see some other minor UX improvements in the future, e.g. closure
as a list of dict instead of entity_closure
and entity_closure_label
as two separate lists, but those would be for another day; here I'm just looking for the data model to match the intent of directionless-association search :)
I'm not sure if we need to return the requested limit and offset to the user. But maybe doing so is standard in REST though, like if defaults are used? In which case I would do so at the very top level only (there seems to be a bug as well, since its reported at multiple levels, and as 5 as requested and 20)
It being reported at multiple levels is a side effect of the LinkML data model we use - both the MultiEntityAssociationResults
and CategoryGroupedAssociationResults
extend the Results
class, which contains required limit
, offset
, and total
field.
We might want to rename the offset and limit params to make it clear that those are per entity/category group
Same thing here.
To address both these concerns, we could switch to creating entirely unique data classes for each of these that would deviate from the established Results
model to which all other monarch-py responses adhere. For example, we'd need a secondary Entities
class, and a secondary Associations
class, each of which contains similar but distinctly different slots from their primary counterparts (since LinkML doesn't really allow the "renaming" of slots depending on what class they're a part of ((@kevinschaper feel free to correct me if I'm wrong on that last point))).
But we definitely could.
Maybe rename 'items' to associations?
This should be doable without the above mentioned changes.
I don't think we should report subject and object info in the items, which requires logic to determine if the info I'm looking for is in the subject or the object (which depends on the direction of the association, and repeats a lot of info). Rather, we should report just the counterpart info; we could call it 'counterpart', 'counterpart_label', etc. to make it clear. We could add info about the direction maybe, for example if the predicate is 'biolink:gene_associated_with_condition; then the gene is the subject, so we could report something like 'counterpart_role': 'subject'.
This is a tricky part [for me, anyway], as I don't know the best way to generalize figuring out the "direction". Unless we have a complete set of all predicates in the Monarch KG and their predefined directions.
Or maybe it's as simple as "if entity is subject, counter_part role is object, and vice versa"?
All the other association-specific info (onset, stage, sex etc) can stay at the lowest association level
This is another tricky part, as Associations have a lot of slots:
So I'm not immediately sure what would be copy-paste and what would be changed, or what a good name for this secondary Association class would be. I'll mull that over a bit.
If we want to keep the metadata for the searched entity, rather than having it as subject/object info in the lowest level, it could be at the level of the entity, so 'entity_category', 'entity_label', 'entity_closure' etc.
Let's not include totals that can be computed by summing over lower-level totals to avoid confusion
These last two points may just get handled as a consequence of creating these new dataclasses to use, but the logic will definitely take some chewing on to get right.
Thanks for thinking it over :) I see what you are getting at - I'm asking for something that isn't a great fit for the existing data models. Creating new classes might be needed, but let me think a bit as well to see if there's a more data-modely way to organize what I'm trying for.
One easy answer:
"Or maybe it's as simple as "if entity is subject, counter_part role is object, and vice versa"?
Yes, that's exactly what I was thinking :) But then, with a different data model this might be straightforward.
RE slots in classes - can slots be optional for some classes and required for others? I've been thinking of how to do a "light" return for common cases vs full-data return for everything-and-kitchen sink. Considering a very straightforward use where I want to do a keyword search to get a list of Entities back, if I set full_info
to True in the query I get everything, but if I set it to False I just get back ID and label. But for another type of query maybe the description
slot is included no matter what.
Edit: I suppose this could be done with like an EntityMinimal and EntityFull that inherits from it...
In the meantime I just opened a PR (#385) where I've just gone ahead and started making those required classes, and am in the process of changing up the logic
Asked @sagehrke to put a pin in this, my understanding of the KG data model has evolved and I now better understand why this is a tricky ask. What I'm looking for may be better suited to answering via the new neo4j deployment and if what I come up with is more broadly useful perhaps it can go to the API from there.
Now that we have a multi-entity association query on the backend, we need to expose it via the api-v3.monarchinitiative.org API :D Ref https://github.com/monarch-initiative/monarch-app/issues/270