related-sciences / nxontology-ml

Machine learning to classify ontology nodes
Apache License 2.0
6 stars 0 forks source link

Build an EFO term "precision" classification pipeline #2

Open eric-czech opened 1 year ago

eric-czech commented 1 year ago

There are several EFO term classifications that could be useful. I propose we start with trying to assign a certain precision to terms based on the following definitions:

note: More examples like this are given in https://github.com/related-sciences/nxontology-ml/issues/3.

I like this description of the task and these names/definitions more than the disease "subtype", "root" and "area" idea we had used before internally, and it better captures what I was initially trying to accomplish with that work anyhow. I'm certainly open to discussing it more though.

We can use some of the labels we already have to bootstrap this effort and I would say the next steps are:

  1. Include an LLM-derived classification of this label as a feature to be used in conjunction with ontology features to make a final classification
  2. Explore options for modeling in this task beyond applying a GBM

I'll add some more details on those steps in related issues.

TODO:

dhimmel commented 1 year ago

I also prefer the proposed disease precision scale of low/medium/high to the former scale of area/root/subtype, since it avoids any confusion with the term "root". The proposed definitions and helpful, and as noted some examples would enrich the definitions, as would review of some terms that do not fit cleanly within a single category.

Since our use case at Related Sciences is ultimately for drug development, these definitions are anchored to clinical characteristics and specificity. I imagine there might be some more non-clinical characteristics of each precision level that could enrich and solidify the definitions to support broader applications.

Looping in @DnlRKorn @matentzn @zoependlington @nicolevasilevsky @d0choa in case you have feedback on whether classifying diseases based on precision would be useful and whether the low/medium/high scale and definitions make sense. This classification is something we're initially planning to perform on EFO but could extend to MONDO as well.

matentzn commented 1 year ago

@dhimmel I think from our perspective, the most significant distinction is between "true diagnosable disease" and "disease grouping"! This is sort of related to the "precision" mentioned here, but maybe needs other kinds of evidence, like "mentions in PubMed" etc.

dhimmel commented 1 year ago

the most significant distinction is between "true diagnosable disease" and "disease grouping"

Thanks @matentzn for weighing in. I think a true diagnosable disease might be the union of the medium and high precision buckets, while disease grouping would be low. We also could create a 2-class outcome in addition that could be predicted from the feature set we create, which should include features like publication mentions and other things.

Linking a related issue at https://github.com/monarch-initiative/mondo/issues/685.

Also I notice EFO:0000574 / lymphoma has two subsets according to the EBI OLS browser: disease_grouping, ordo_group_of_disorders. @zoependlington or others: where are these subsets defined, how are they assigned, and is their more documentation on them?

matentzn commented 1 year ago

Most of these come from Mondo, and are the consequence in metamodelling of ontologies aligned with mondo. For example, there is a group of OrDO classes explicitly defined as groupings in ORDO, which make up that subset. For the more general disease_grouping subset, I think this was a fairly incomplete attempt to manually curate disease groupings. @nicolevasilevsky would know best!

eric-czech commented 10 months ago

Linking https://github.com/related-sciences/nxontology-ml/pull/5, which added the labeled data from EFO v3.43.0

eric-czech commented 10 months ago

Related to https://github.com/related-sciences/nxontology-ml/issues/13, I wanted to add two visuals that clearly communicate something about why we did this originally:

  1. A list of terms by precision level

Screen Shot 2023-09-20 at 4 09 38 PM

Code: random_efo_term_samples.ipynb

  1. What happens when you use this to remove the most general terms in EFO
Screen Shot 2023-09-16 at 3 30 58 PM

As a graph, EFO is very difficult to work with. Removing the most general terms (i.e. precision=low) greatly improves our ability to segregate out more clinically relevant disease hierarchies/subgraphs.

Note: "post-processing" in this case means creating a new graph with all nodes and edges for low precision terms removed.

dhimmel commented 10 months ago

A list of terms by precision level

Nice very helpful. What do you think about recreating the classification example table but for each row including paired low, medium, high triplets. Such that the medium term would be a descendant of the low term, and the high precision term would be a descendant of the medium. This would exclude some terms from being selected, for example by excluding medium terms with no low ancestors.