Prepare the MONDO outreach presentation materials

dhimmel commented 1 year ago

On 2023-09-22, we're giving a presentation at the Mondo Workshops / Outreach Call. The current title is "Classifying EFO/MONDO diseases as areas, roots, or subtypes". Daniel Korn from Every Cure will also present, so we should plan to not exceed 25 minutes.

We can use this issue to coordinate the slides and materials.

dhimmel commented 1 year ago

Some slides / topics for the presentation

Background on the task and desired classification levels https://github.com/related-sciences/nxontology-ml/issues/2
First application is EFO for OpenTargets integration, but can be easily adapted to MONDO
Subsets are helpful but not exhaustive https://github.com/related-sciences/nxontology-ml/issues/14
Features
- ontology features
- text-derived features: LLMs / embeddings
- anything else
performance
data and repo availability
nxontology suite introduction

dhimmel commented 1 year ago

@yonromai it probably makes sense to tailor the next week of work to make sure we have a MVP dataset and analysis for the presentation.

I think the main things we're missing are:

https://github.com/related-sciences/nxontology-ml/issues/14
including gpt4 classifications as features in the model. Given that the cost is not so high, we can query all terms now, even if we think we're going to improve the prompts going forward and rerun everything.
store the classifications from the final model
with the final model, some analyses of feature importance, so we can give a general idea of which features matter most.

How does this sound? If it seems infeasible, we can restrict the scope of the MVP.

yonromai commented 1 year ago

@yonromai it probably makes sense to tailor the next week of work to make sure we have a MVP dataset and analysis for the presentation.

Definitely!

I think the main things we're missing are:

Include EFO subsets as features for model #14

This issue fell through the cracks on my end, would probably need more context on what's involved.

including gpt4 classifications as features in the model. Given that the cost is not so high, we can query all terms now, even if we think we're going to improve the prompts going forward and rerun everything.

Sounds good.

store the classifications from the final model

Right - it's been bothering me that I have left all the data artifacts out so far. Do you guys use internally tools like Git LFS and/or DVC? (cc: @ravwojdyla)

with the final model, some analyses of feature importance, so we can give a general idea of which features matter most.

Sure! In terms of feature importance, does something like this (see "Example of run") work?

How does this sound? If it seems infeasible, we can restrict the scope of the MVP.

Good! Do you think you'd have some time tomorrow / early next week for a quick live catch up on this ^ ?

dhimmel commented 1 year ago

it's been bothering me that I have left all the data artifacts out so far

For this repo, we'll want to use a method where the data easily available to a public user, preferably integrated with the repo. I like Git LFS, but the GitHub billing for LFS can be excessive. The method we use for ensembl-genes and nxontolgoy-data is to commit the data directly to an output branch without LFS. Works if all datasets are < 100 MB, which I assume with be the case here. Kind of hacky but gets you version control, gratis storage, and good forkability and community accessibility.

Sure! In terms of feature importance, does something https://github.com/related-sciences/nxontology-ml/pull/7 work?

Yes that works, especially if those importance values can be aggregated, so we can show groupings of features.

Do you think you'd have some time tomorrow

Yes, let's find a time in slack.

yonromai commented 1 year ago

For this repo, we'll want to use a method where the data easily available to a public user, preferably integrated with the repo. I like Git LFS, but the GitHub billing for LFS can be excessive. The method we use for ensembl-genes and nxontolgoy-data is to commit the data directly to an output branch without LFS. Works if all datasets are < 100 MB, which I assume with be the case here. Kind of hacky but gets you version control, gratis storage, and good forkability and community accessibility.

Right, usually I prefer to avoid putting large data files in git since (1) it makes git increasingly slow and (2) it becomes more likely to hit the 1GB GH size limit (unless erasing data file history). If the checked-in data is small and fairly static then none of these are an issue.

In the case of checking-in experiments, training set and output models (to increase reproducibility and preserve history) - it becomes more of a problem. Using a solution backed by something like S3 or GCS could work long term but is definitely too much work / out of scope for now. If history/reproducibility aren't super important, it's probably okay too keep being mindful about which dataset is persisted and only check-in essential datasets (like efo_otar_slim_v3.43.0_rs_classification.tsv).

=> I agree that checking the "final" model in git is the way to go forward at this point in time.

Yes that works, especially if those importance values can be aggregated, so we can show groupings of features.

Cool! Let's discuss all these points tomorrow

dhimmel commented 1 year ago

I put up a rough outline and template slide deck at https://slides.com/dhimmel/efo-disease-precision.

matentzn commented 1 year ago

@dhimmel nice to prepare the seminar in the open. Just FYI, some people in Monarch have already asked that we look at some concrete examples of diseases where it is hard to classify them as grouping (area) or proper disease. Will you mention a few examples of clear-cut subtypes, groupings, and diseases, and a few questionable ones for debate?

cmungall commented 1 year ago

Just a quick note here on the difficulties of establishing ground truth here, don't have time for a full summary here but see this as a quick literature proxy (usual caveats AI etc):

https://www.perplexity.ai/search/Is-Parkinson-disease-G_cC1TZOQ121Q2ZWmKjCZw?s=c

eric-czech commented 1 year ago

@dhimmel one of these might be worth including: https://github.com/related-sciences/nxontology-ml/issues/2#issuecomment-1728373390

I have no strong opinion one way or the other, but +1 to having a sampling of terms by precision label somewhere after https://github.com/related-sciences/nxontology-ml/issues/30.

dhimmel commented 1 year ago

@eric-czech the one area that I don't feel fully confident in presenting is how the initial RS labels that we use for training were calculated. Would you be able to jot down a few notes on that process here? I could then copy that to a slide and then you could present that slide.

eric-czech commented 1 year ago

a few notes on that process

Certainly. The process went like this:

Start with an empty set of annotations, one for each EFO term and precision class (low, medium, high)
Repeat this loop:
- Curate terms from these heuristic functions to bootstrap some positive / negative examples by class:
- low: Has very low intrinsic IC
- low: Has very high aggregate Open Targets evidence
- ¬low: Has clinical trials associated
- medium: Has a lot of descendants with many of the same tokens in the label
  - e.g. familial hypertrophic cardiomyopathy -> hypertrophic cardiomyopathy {2,4,10,11,12,17,20,26}
- medium: Has an OMIM PS xref and has descendants
- Add the above to the set of annotations
- Run these automatic labeling functions given the existing set of annotations
- high: Is a terminal node descending from an OMIM PS term
- high: Is a terminal node descending from an existing medium precision term
- low: Ends with neoplasm and has cancer descendants
- ¬low: Has no descendants
- Run separate binary ML light GBM models for each class and propose new terms for them
- Manually review the top predictions for each class w/o a current label
- Append new terms to the existing set of annotations
At any time, override existing annotations based on manual review

I went through this loop ~5 times and stopped once the top predictions for the low precision terms started to become more difficult to distinguish from medium precision terms.

dhimmel commented 1 year ago

Thanks @yonromai and @eric-czech for the help presenting. We can leave this issue up until the recording is online and we add a link to the recording/slides to the README.

Copying the zoom chat log here from today, since there are some good suggestions that we should follow up on.

11:58:38 From Sarah G. To Everyone:
    https://www.slido.com/
11:58:47 From Nicole Vasilevsky To Everyone:
    https://app.sli.do/event/4m6g4VAFFpHsAQTWwVAjve
12:13:41 From Chris Mungall To Everyone:
    We should talk about synergies with oaklib (but not on this call!)
12:14:01 From trish To Everyone:
    Reacted to "We should talk about..." with 👍
12:14:09 From Sarah G. To Everyone:
    Reacted to "We should talk about..." with 👍
12:23:51 From Chris Mungall To Everyone:
    Some of the text-based features could be replaced by axioms (e.g. “ends with neoplasm”) … this is awesome!
12:25:23 From Chris Mungall To Everyone:
    @Nicole I don’t know if we are ready to show our design patterns (or just the tags). The meta class is the most powerful feature I’d expect
12:26:42 From Nicole Vasilevsky To Everyone:
    Here are the design patterns: https://mondo.readthedocs.io/en/latest/editors-guide/patterns/
12:26:50 From Chris Mungall To Everyone:
    I would hope our prefixes are standardized!
12:29:02 From Joe Flack To Everyone:
    are the nxontology repos private?
12:29:14 From Eric Sid To Daniel Himmelstein(Direct Message):
    Hi Daniel, great talk so far - would you be able to share your slides afterwards?
12:32:58 From Sarah G. To Everyone:
    https://www.slido.com/
12:33:04 From Sarah G. To Everyone:
    Replying to "https://www.slido.co..."

    1342945
12:33:23 From trish To Everyone:
    Replying to "are the nxontology r..."

    Some repos are public https://github.com/related-sciences
12:34:25 From Chris Mungall To Everyone:
    My vote: Metaclass >> gut >> topology
12:34:38 From Chris Mungall To Everyone:
    Gpt not gut (feeling :-)
12:34:46 From Nico To Everyone:
    Reacted to "Gpt not gut (feeling…" with 😂
12:35:01 From Sarah G. To Everyone:
    Reacted to "Gpt not gut (feeling..." with 😂
12:35:18 From Sarah G. To Everyone:
    https://www.slido.com/     1342945
12:35:24 From Gioconda Alyea To Everyone:
    topology, description, cross reference
12:44:11 From Sarah G. To Everyone:
    https://github.com/related-sciences/nxontology
12:44:19 From Nico To Everyone:
    Super interesting guys, I have to run now, willl follow up
12:49:53 From Chris Mungall To Everyone:
    Yes! We have an evaluation happening if you want to join that
12:50:09 From Chris Mungall To Everyone:
    [for @Eric]
12:50:52 From Daniel Himmelstein To Eric Sid(Direct Message):
    Slides from our talk at https://slides.com/dhimmel/efo-disease-precision
12:51:14 From Eric Sid To Daniel Himmelstein(Direct Message):
    Replying to "Slides from our talk..."

    Thanks Daniel. Great work, very interesting!
12:54:11 From Eric Czech To Everyone:
    Awesome, would love to take a look and/or get more involved
12:54:15 From Eric Czech To Everyone:
    How can we do that?
13:01:16 From Daniel Himmelstein To Everyone:
    Slides from our talk https://slides.com/dhimmel/efo-disease-precision
13:02:19 From Eric Sid To Everyone:
    Great talks!
13:02:32 From Gioconda Alyea To Everyone:
    Very interesting!  Great to know about all you do and will want to keep in contact
13:03:05 From Sarah G. To Everyone:
    I will leave you in the call to chat!
13:03:09 From Sarah G. To Everyone:
    Thank you!

related-sciences / nxontology-ml

Prepare the MONDO outreach presentation materials #13