Resolve the "grouping classes don't get returned" problem

sierra-moxon commented 7 months ago

Given the phenotypic profile here (which is a list of phenotypes for various forms of Cerebral Palsy):

HP:0001258, HP:0001371, HP:0100543, HP:0001249, HP:0002650,HP:0001347,HP:0001250,HP:0001263,HP:0001249,HP:0002650,HP:0001252,HP:0001251,HP:0007010

When I ask for diseases that match this phenotypic profile, I don't get "Cerebral Palsy" back as an answer even though cerebral palsy seems like a "grouping term" over two distinct phenotype sets (e.g. hypotonia and spasticity) which may or may not be the distinguishing factor of the subclasses of CP.

I am given the option to choose a particular disease that I want to match against, but "Cerebral Palsy" isn't in the pull-down.

This same experiment was conducted with phenotypes of Ehlers-danlos and Marfan and Marfan Related Disorder with the same results. I can't get the "grouping classes" to be returned in the phenogrid, nor can I tell it to give me the results to one of these grouping classes specifically (because they aren't available in the pull-down -- but also because they don't come back from the semsimian API endpoint that this repo wraps with TRAPI).

We did some analysis of this with @kevinschaper @RichardBruskiewich @justaddcoffee and @mbrush in slack and concluded "because the collection of phenotypes for each mondo term is coming from direct annotations only, and the grouping terms don’t have direct annotations, they can’t be expanded into a phenotype list for search, and they can’t be found as a match. This question came up on a mondo outreach call too." Kevin's huddle slides on this.

From that discussion on slack: Is there a reason why we wouldn’t want CP as a grouping term to just inherit annotations on its descendants? Should we make some kind of subset which is ok for propagation, which maybe is also a diagnosable subset? Right now the monarch-kg build process adds the direct associations to phenio.db, so it could also be that we just add additional association rows in that process so that semsimian doesn't need to deal with it. Maybe we would need to extend the schema with a boolean for whether an association is direct or not

If we can identify at least an initial set of appropriate grouping classes, should we work to roll phenotypes up?

Should this be transparent to Semsimian, or should it the algorithm approach annotations to a subtype differently?

sierra-moxon commented 7 months ago

In general, we should be able to take phenotypes from phenopackets from real-world patients diagnosed with "Cerebral Palsy" or "Marfan Syndrome" and get back those diagnosed diseases from semsimian. Right now that isn't working because of this "grouping issue" (for lack of a better word).

To get some tests in place, we need to process a bunch of Phenopackets from this repo: https://github.com/monarch-initiative/phenopacket-store/tree/main/notebooks, extracting HP terms and Mondo terms, feeding them into our semsimian API and testing whether the diagnosed disease is in the top 10 results from semsimian.

sierra-moxon commented 7 months ago

I like @kevinschaper's identification of propagating descendent annotations up the hierarchy as long as the parent is in a "diagnoseable subset"

kevinschaper commented 7 months ago

The other question that occurred to me is, are we avoiding the roll up to parent classes only for performance reasons, or are we also trying to avoid returning matches that would seem strange or nonsensical?

If it's strictly performance related, I wonder if we could allow rolling up as long as the distinct phenotype term count stays below a certain number.

RichardBruskiewich commented 7 months ago

The medical profession obviously determined what (phenotypic and genotypic) feature characterize a given disease, although this may often be fuzzy, with significant variations between affected individuals; perhaps a weighted sum across the range of all such features - from the child classes - constitutes the “diagnoseable subset”.

matentzn commented 7 months ago

This is a very cool issue! Actually, this is a real scientific, creative, non-trivial thing to solve. Thanks for opening it.

First of all I have to say that without thinking, and without caring about whether something is a grouping or not, I would like groupings to be returned one way or another - even if the grouping is a "non-diagnosable" disease as they are called sometimes (cardiovascular disorder). Now I also agree that these groupings need to be clearly visually separated from the "pure profile matches". But the exact way to get these in is very debatable.

d2p propagation. As you suggest, one way is to propagate phenotype annotations. I am not sure if this is ideal, but could be. We all know: ontological relations travel down the graph, annotations up the graph (very confusing). So you could, I guess, associate every class with all phenotypes of its descendants and say "inferred" before running the match.
Another is to just use the glass graph. You could, for example, first match the disease, and then show parents up to a certain point in a little widget right next to the concrete match. This is easy to implement, and less "misleading" as it clearly separates the genuine matches from the "parent matches". The downside is that you dont see the cute coloured squares.

The other question that occurred to me is, are we avoiding the roll up to parent classes only for performance reasons, or are we also trying to avoid returning matches that would seem strange or nonsensical?

The obvious downside for d2p propagation (other than the performance issues) is that certain diseases will get large inferred profiles. These probably wont match quite well. But - maybe not as bad in the end, as we dont want parent classes to match as well as their children. In fact, you could even penalise the profile similarity when inferred d2ps are used, but that could be hard to generalise and implement. I am not that worried about non-sensical results as long as we can label them somehow as "inferred", and manually curate in Mondo itself an exclusion list.

Restricting the roll-up to specific parent child combinations or "diagnosable" diseases will be a quite major research thing, but it could be approximated by saying: ICD10 and OMIMPS is ok by default, and for the rest we need rules.

Not sure. Are there any other ideas beside d2p roll-up and simple inclusion of parents in the display?

cmungall commented 7 months ago

The majority of this should be at the UI level

if the user selects a disease entity with subtypes, roll up the phenotypes, but make it transparent
when showing results, as a post-processing or UI step highlight diseases that are in a group

I don't think the algorithms or API should get ahead of the Mondo group or annotation. If Monarch has chosen to annotate phenotypes to a particular Mondo class (directly or indirectly e.g via OMIM) then they have decided that is the entity, and not the superclass. I think there is room for this group to improve procedures here (e.g. Bardet Biedl) but it's best not to get ahead

kevinschaper commented 7 months ago

Thinking about how to avoid this brick wall:

A small first step could be that rather than completely rejecting Cerebral Palsy because it has no direct phenotypes, we could just allow for populating phenotypes of descendants, but have a quantity limit and provide a message instead like "Please choose a more specific disease." when the count is above some number (maybe 1000? which might be painfully slow, maybe 500? the largest diseases I believe are in the ~300 range)

cmungall commented 7 months ago

this is also why it's not good to get ahead of mondo....

kevinschaper commented 7 months ago

I had my browser tab open without a refresh for a while before replying earlier today and totally missed your suggestions @cmungall.

Right now the UI owns the expansion of a disease into phenotypes, so it could minimally also show the list of specific diseases that the phenotypes came from. (Just two parallel list, not a full breakdown)

That would cover searching by and comparing non-leaf/grouping diseases pretty easily, I assume that it would take something at the algorithm level to return these higher level disease terms as a match? I also don't have a strong sense of how important this use case is.

matentzn commented 7 months ago

Github issue is terrible for this kind of discussion, as you cant really respond to individual comments.

The majority of this should be at the UI level

It seems you (@cmungall) dont see any value to match against grouping classes - fair enough. I should not get involved with “domain thinking” - my instinct would have told me that roll-ups can be value beyond subtype->maintype cases.

Does this mean we need to introduce some "metamodelling" again and connect diseases with their subtypes, make sure this informations gets into the API, and have the UI code coll up phenotypes along subtype->maintype relations (but not beyond)?

@kevinschaper

Right now the UI owns the expansion of a disease into phenotypes, so it could minimally also show the list of specific diseases that the phenotypes came from. (Just two parallel list, not a full breakdown)

What does this mean? Maybe a concrete example would be useful?

monarch-initiative / monarch-trapi-kp

Resolve the "grouping classes don't get returned" problem #6