Write initial draft of HPO GPT annotations manuscript

bschilder commented 1 year ago

I'll leave it up to you to decide how you'd like to write it up @KittyMurphy , but besides Google Docs this might be a good time to try writing a manuscript entirely in Rmarkdown or Quarto.

Existing code/results you can use as a basis for the manuscript:

KittyMurphy commented 12 months ago

Decided on quarto, markdown can be found here: https://github.com/neurogenomics/RareDiseasePrioritisation/tree/master/manuscript

bschilder commented 8 months ago

Just checking @KittyMurphy , is this the most up-to-date version of the GPT annotations?

https://github.com/neurogenomics/RareDiseasePrioritisation/blob/master/gpt_annotations/gpt4_hpo_annotations.csv

hpo <- HPOExplorer::get_hpo()
path <- paste0(
      "https://github.com/neurogenomics/RareDiseasePrioritisation/raw/master/",
      "gpt_annotations/gpt4_hpo_annotations.csv"
    )
 d <- data.table::fread(path, header = TRUE)
length(unique(d$phenotype))

It currently contains 11159/18082 HPO phenotypes.

KittyMurphy commented 8 months ago

Thanks for the reminder, I've just uploaded the newest version @bschilder

From: Brian M. Schilder @.> Sent: 10 March 2024 13:23 To: neurogenomics/RareDiseasePrioritisation @.> Cc: Murphy, Kitty @.>; Mention @.> Subject: Re: [neurogenomics/RareDiseasePrioritisation] Write initial draft of HPO GPT annotations manuscript (Issue #31)

This email from @.*** originates from outside Imperial. Do not click on links and attachments unless you recognise the sender. If you trust the sender, add them to your safe senders listhttps://spam.ic.ac.uk/SpamConsole/Senders.aspx to disable email stamping for this address.

Just checking @KittyMurphyhttps://github.com/KittyMurphy , is this the most up-to-date version of the GPT annotations?

https://github.com/neurogenomics/RareDiseasePrioritisation/blob/master/gpt_annotations/gpt4_hpo_annotations.csv

hpo <- HPOExplorer::get_hpo() path <- paste0( "https://github.com/neurogenomics/RareDiseasePrioritisation/raw/master/", "gpt_annotations/gpt4_hpo_annotations.csv" ) d <- data.table::fread(path, header = TRUE) length(unique(d$phenotype))

It currently contains 11159/18082 HPO phenotypes.

— Reply to this email directly, view it on GitHubhttps://github.com/neurogenomics/RareDiseasePrioritisation/issues/31#issuecomment-1987229449, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ANQCHWAVWAJHZWO6PVI4ROLYXRNETAVCNFSM6AAAAAA67TQLRKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOBXGIZDSNBUHE. You are receiving this because you were mentioned.Message ID: @.***>

bschilder commented 8 months ago

thanks! just downloaded it.

one thing im trying to sort out is, there's 917 terms that i cant match up to HPO IDs. these could either be:

cases where the HPO term name changed over time within the HPO
cases where GPT took some small liberties with renaming the HPO term

path <- paste0(
      "https://github.com/neurogenomics/RareDiseasePrioritisation/raw/master/",
      "gpt_annotations/gpt4_hpo_annotations.csv"
    )
 d <- data.table::fread(path, header = TRUE)
 d <- HPOExplorer::add_hpo_id(d)
d[is.na(hpo_id)]

Screenshot 2024-03-11 at 10 06 09

For example, "Reduced 3-phosphoglycerate dehydrogenase activity" is currently a term in the HPO, but "3-hydroxyacyl-CoA dehydrogenase activity" (from the GPT annotations) is not. https://hpo.jax.org/app/browse/term/HP:0034691

Another is "Weakness of facial musculature" vs. "Weakness of Facial Musculature". since this is just a capitalisation issue i can account for situations like this programmatically. https://hpo.jax.org/app/browse/term/HP:0030319

KittyMurphy commented 8 months ago

I just looked back at one of the prompts files and "3-hydroxyacyl-CoA dehydrogenase activity" and "White streaks/specks on enamel" are both there, so at least for these phenotypes its a case of HPO term names changing.

bschilder commented 8 months ago

ok good to know! these changes should all be recorded somewhere by HPO, but working on figuring how to access deprecated IDs.

in the meantime, I may just try to use an old version of the HPO and see if that helps. do you know which version you used?

KittyMurphy commented 8 months ago

Re: HPO version it would be whatever was used in MultiEWCE and HPOExplorer.

I originally got the HPO terms using:

all_res <- MultiEWCE::load_example_results()

I then switched to: all_res <- HPOExplorer::make_phenos_dataframe()

Since February I've been using the .obo file from the 2024-02-08 Release. If you follow the link you can see files associated with each release which might be useful in finding old/new terms!

bschilder commented 8 months ago

Re: HPO version it would be whatever was used in MultiEWCE and HPOExplorer.

Right, but they don't use static versions anymore. they download them directly from the HPOs latest release at the time you first run the commands.

Since February I've been using the .obo file from the 2024-02-08 Release. If you follow the link you can see files associated with each release which might be useful in finding old/new terms!

Does this mean you used other versions too? When did you collect the annotations from GPT?

bschilder commented 8 months ago

Just implemented some changes in my mapping functions so that they can ignore case, which helped a bit but still have 851 terms with mismatched hpo names. Sharing this remaining list here so you can focus on just these @KittyMurphy mismatched_hpo_names.csv.gz

KittyMurphy commented 8 months ago

So far, I've managed to get HPO ids for 723/851 terms. Sharing here: mismatched_hpo_names_fixed.csv

To achieve this, I used the HPO .obo file versions I used for the annotations (November 2023 and February 2024), either by directly merging on 'hpo_name', or by using the 'synonym'

For ~300 terms I got the id using the HPO .obo versions I used for the annotations (November 2023 and February 2024). For the remainder I used the 'synonym' information, also from the .obo.

bschilder commented 7 months ago

Manuscript now located here: https://github.com/neurogenomics/gpt_hpo_annotations

neurogenomics / RareDiseasePrioritisation

Write initial draft of HPO GPT annotations manuscript #31