neurogenomics / RareDiseasePrioritisation

Prioritise cell-type-specific gene targets from the Rare Disease Celltyping project.
1 stars 0 forks source link

Annotate diseases/phenotypes using chatGPT #19

Closed bschilder closed 8 months ago

bschilder commented 1 year ago

(checked boxes indicate at least an initial attempt has been made)

Annotations

Models

Related

Some of my initial attempts are documented within this R package: https://github.com/neurogenomics/gptPhD

@KittyMurphy once you have a chance please report your progress here. I'll do the same.

bschilder commented 1 year ago

@KittyMurphy please document your progress on this here

KittyMurphy commented 1 year ago

Annotating HPO phenotypes using chatGPT via gptstudio

Set up

install.packages("gptstudio")
library(gptstudio)

# Load HPO terms 
terms_dt = HPOExplorer::load_phenotype_to_genes(3)
terms_cols = list(name="Phenotype",
                  id="ID")

# Get unique terms and their ID's 
terms_dt_sub <.- unique(terms_dt[,unname(unlist(terms_cols)), with=FALSE])

Attempt #1

Here I'm using the congenital onset terms (without HPO ID) that were provided to us by Peter Robinson. Will also try:

define the effects you need answers to e.g. does the phenotype cause death

effects <- "mental retardation, death, impaired mobility, physical malformations, blindness, sensory impairments, immunodeficiency, cancer, reduced fertility."

define the columns of the output table

table_columns <- "phenotype, mental retardation, death, impaired mobility, physical malformations, blindness, sensory impairments, immunodeficiency, cancer, reduced fertility, congenital onset, jusitification."

define chatGPT prompt

question = paste("Do:", congenital_onset, ", typically cause:", effects, "Do they have congenital onset?", "You must give one-word yes or no answers and give a justification for why they do or don't have congenital onset.", "You must provide the output in .tsv format with columns:", table_columns) question <- gsub("\n", "", question)

run chatgpt 5 times for the same prompt

n = 5 run_chatgpt <- function(q){ all_res <- gptstudio::openai_create_chat_completion(prompt = question) choices <- fread(all_res[["choices"]]$message.content) }

res_allPheno <- lapply(seq_len(n), function(x) run_chatgpt(1))

res_allPheno_dt <- data.table::rbindlist(res_list,fill = TRUE, use.names = TRUE, idcol = "iteration")

order alphabetically so that you can compare results across phenotypes

res_allPheno_dt <- res_allPheno_dt [order(res_allPheno_dt $phenotype), ]


Below is a subset of `res_allPheno_dt`. The answers chatGPT gives over iterations of the same prompt are not consistent e.g. look at mental retardation for coloboma. A coloboma is an area of missing tissue in your eye, and through a quick google search is not associated with mental retardation. 
iteration | phenotype | mental retardation | death | impaired mobility | physical malformations | blindness | sensory impairments | immunodeficiency | cancer | reduced fertility | congenital onset | justification
-- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- 
1 | Atrioventricular canal defect | Yes | Yes | Yes | Yes | No | No | No | No | No | Yes | Congenital heart defect present at birth
2 | Atrioventricular canal defect | Yes, in some cases | May lead to premature death | no | May lead to growth failure, fatigue or rapid breathing | May lead to vision problems | None | None | None | No | AV canal defect is present at birth and is a congenital condition. |  
3 | Atrioventricular canal defect | Yes | Yes | No | Yes | No | No | No | No | No | Yes | Atrioventricular canal defect is a congenital heart defect in which there is an opening in the center of the heart where the walls separating the heart chambers should be.
4 | Atrioventricular canal defect | Yes | Possible | None | Physical malformations | No | No | No | No | No | Yes | Congenital onset is typical of this phenotype as it is a result of abnormal development of the heart during fetal development.
5 | Atrioventricular canal defect | Yes | Yes | No | Yes | No | No | No | No | No | Yes | It is a congenital heart defect that is present at birth.
1 | Cleft helix | No | No | No | Yes | No | No | No | No | No | Yes | Congenital ear malformation present at birth
2 | Cleft helix | No | None | None | May lead to physical malformations of the ear | None | None | None | None | Yes | Cleft helix is present at birth and is a congenital condition. |  
3 | Cleft helix | No | No | No | Yes | No | No | No | No | No | Yes | Cleft helix is a congenital anomaly characterized by a cleft or gap in the top part of the ear.
4 | Cleft helix | No | None | None | Physical malformations | No | No | No | No | No | Yes | Congenital onset is typical of this phenotype as it is a result of incomplete development of the ear during fetal development.
5 | Cleft helix | No | No | No | Yes | No | No | No | No | No | Yes | A cleft helix is a rare congenital malformation of the ear.
1 | Coloboma | Yes | No | No | Yes | Yes | Yes | No | No | No | Yes | Present at birth and can affect vision and eye structure
2 | Coloboma | No | May lead to vision problems or blindness | May depend on location on the body | None | May lead to vision problems or blindness | May lead to hearing loss or deafness | None | None | No | Coloboma is present at birth and is a congenital condition. |  
3 | Coloboma | Yes | No | No | Yes | Yes | Yes | No | No | No | Yes | Coloboma is a congenital anomaly characterized by a gap or hole in one of the structures of the eye.
4 | Coloboma | No | None | None | Physical malformations | Possible | Possible | No | No | No | Yes | Congenital onset is typical of this phenotype as it is a result of incomplete fusion of the tissues that form the eye during fetal development.
5 | Coloboma | Yes | No | No | Yes | Yes | No | No | No | No | Yes | A coloboma is a birth defect that affects the eye.
1 | Cryptotia | No | No | No | Yes | No | No | No | No | No | Yes | Congenital ear malformation present at birth
2 | Cryptotia | No | None | None | May lead to physical malformations of the ear | None | None | None | None | Yes | Cryptotia is present at birth and is a congenital condition. |  
3 | Cryptotia | No | No | No | Yes | No | No | No | No | No | Yes | Cryptotia is a congenital anomaly characterized by a hidden ear that is partially or completely covered by skin.
4 | Cryptotia | No | None | None | Physical malformations | No | No | No | No | No | Yes | Congenital onset is typical of this phenotype as it is a result of abnormal development of the ear during fetal development.
5 | Cryptotia | No | No | No | Yes | No | No | No | No | No | Yes | Cryptotia is a congenital ear deformity.
1 | Cupped ear | No | No | No | Yes | No | No | No | No | No | Yes | Congenital ear malformation present at birth
2 | Cupped ear | No | None | None | May lead to physical malformations of the ear | None | None | None | None | Yes | Cupped ear is present at birth and is a congenital condition. |  
3 | Cupped ear | No | No | No | Yes | No | No | No | No | No | Yes | Cupped ear is a congenital anomaly characterized by an ear that is shaped like a cup and protrudes outward from the side of the head.
4 | Cupped ear | No | None | None | Physical malformations | No | No | No | No | No | Yes | Congenital onset is typical of this phenotype as it is a result of abnormal development of the ear during fetal development.
5 | Cupped ear | No | No | No | Yes | No | No | No | No | No | Yes | A cupped ear is a congenital malformation.
1 | Exstrophy | Yes | No | Yes | Yes | No | No | No | No | No | Yes | Present at birth and affects bladder and pelvic development
2 | Exstrophy | No | None | None | May lead to physical malformations of the abdominal wall or pelvic organs | None | None | None | May lead to reduced fertility | Yes | Exstrophy is present at birth and is a congenital condition. |  
3 | Exstrophy | Yes | No | Yes | Yes | No | No | No | No | No | Yes | Exstrophy is a congenital anomaly characterized by a defect in the abdominal wall or bladder.
4 | Exstrophy | No | None | None | Physical malformations | No | No | No | No | No | Yes | Congenital onset is typical of this phenotype as it is a result of abnormal development of the abdominal wall during fetal development.
5 | Exstrophy | Yes | No | Yes | Yes | No | No | No | No | No | Yes | Exstrophy is a congenital abnormality where the bladd

### Attempt #2
What if I run the prompt one phenotype at a time, with 3 iterations?

congenital_onset_split <- as.list(strsplit(congenital_onset, "; ")[[1]])

results_list <- list()

for (j in 1:3) { res_individualPheno <- lapply(seq_len(length(congenital_onset_split)), function(i){ pheno <- congenital_onset_split[[i]] question = paste("Does", pheno, "typically cause:", effects, "Does", pheno, "have congenital onset?", "You must give one-word yes or no answers and give a justification for why it does or doesn't have congenital onset.", "You must provide the output in .tsv format with columns:", table_columns) question <- gsub("\n", "", question) print(question) all_res <- gptstudio::openai_create_chat_completion(prompt = question) choices <- fread(all_res[["choices"]]$message.content) return(choices) }) results_list[[j]] <- res_individualPheno_list }

list <- unlist(res_individualPheno_list, recursive = FALSE)

res_individualPheno_dt <- data.table::rbindlist(list,fill = TRUE, use.names = TRUE, idcol = "iteration")

order alphabetically so that you can compare results across phenotypes

res_individualPheno_dt <- res_individualPheno_dt[order(res_individualPheno_dt$phenotype), ]


Below is a subset of `res_individualPheno_dt`, I've shown the same phenotypes as for `res_allPheno_dt` for comparison. There seems to be more consistency across the iterations when you run chatgpt on each phenotype individually. 

phenotype | mental retardation | death | impaired mobility | physical malformations | blindness | sensory impairments | immunodeficiency | cancer | reduced fertility | congenital onset | justification | justification
-- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- |--
Atrioventricular canal defect | no | no | no | yes | no | no | no | no | no | yes | NA | Defect occurs during fetal development, therefore present at birth.
Atrioventricular canal defect | No | No | No | Yes | No | No | No | No | No | Yes | NA | Atrioventricular canal defect is a congenital heart defect. It is present at birth and develops as the heart forms during fetal development. 
Atrioventricular canal defect | No | No | No | Yes | No | No | No | No | No | Yes | NA | Atrioventricular canal defect is a congenital heart defect that occurs during fetal development.
Cleft helix | No | No | No | Yes | No | No | No | No | No | Yes | NA | Cleft helix is a genetic condition that is present at birth, thus indicating that it has a congenital onset.
Cleft helix | No | No | No | Yes | No | No | No | No | No | Yes | NA | Cleft helix is a genetic condition, meaning it is present at birth and caused by inherited gene mutations. It is a congenital condition.
Cleft helix | No | No | No | Yes | No | No | No | No | No | Yes | NA | Congenital onset is indicated by the presence of a physical malformation at birth, which is true for cleft helix.
Coloboma | No | No | No | Yes | Yes | Yes | No | No | No | Yes | NA | Congenital onset means present at birth, and coloboma is a congenital condition that occurs when certain structures in the eye or other parts of the body don't develop properly during fetal growth. Therefore, it has a congenital onset.
Coloboma | No | No | No | Yes | Yes | Yes | No | No | No | Yes | NA | Congenital onset refers to a condition that is present at or before birth. Coloboma is a congenital condition, as it occurs when the eye doesn't develop properly during pregnancy.
Coloboma | no | no | no | yes | yes | yes | no | no | no | yes | NA | Coloboma is a congenital birth defect that affects the eyes, and it is usually present from birth. It is caused by abnormal development of the eye during gestation.
Cryptotia | No | No | No | Yes | No | No | No | No | No | Yes | NA | Cryptotia is a congenital ear anomaly.
Cryptotia | No | No | No | Yes | No | No | No | No | No | Yes | NA | Cryptotia is a congenital ear malformation that is present at birth.
Cryptotia | No | No | No | Yes | No | No | No | No | No | Yes | NA | Cryptotia is a congenital condition, meaning it is present at or before birth.
Cupped ear | No | No | No | Yes | No | No | No | No | No | Yes | NA | Cupped ear is associated with physical malformations and is present at birth (congenital).
Cupped ear | no | no | no | yes | no | no | no | no | no | yes | NA | The development of an ear occurs during fetal development, hence the onset of cupped ear is congenital.
Cupped ear | No | No | No | Yes | No | No | No | No | No | Yes | NA | It is a congenital deformity that occurs during fetal development.
Exstrophy | No | No | Yes | Yes | No | No | No | No | Yes | Yes | NA | It is a birth defect that occurs during fetal development.
Exstrophy | No | No | Yes | Yes | No | No | No | Yes | Yes | Yes | NA | Exstrophy is a congenital anomaly that occurs during fetal development. The anterior body wall fails to properly fuse together, resulting in the exposure of internal organs.
Exstrophy | No | No | Yes | Yes | No | No | No | No | Yes | Yes | NA | Consequence of abnormal embryonic development

### Attempt #3 
Here I'm repeating attempt #1 with the addition of providing chatGPT with the definition of each congenital onset term. 

make dataframe with congenital onset phenotypes and their IDs, match column names to those in hpo meta

congenital_onset_dt <- data.table(preferredlabel = c("Syndactyly", "Ventricular septal defect", "Atrioventricular canal defect", "Atrial septal defect", "Abnormal connection of the cardiac segments", "Fetal anomaly", "Neural tube defect", "Coloboma", "Microtia", "Cryptotia", "Cupped ear", "Cleft helix", "Low-set ears", "Synotia", "Holoprosencephaly", "Exstrophy", "Abdominal wall defect", "Abnormal lung lobation", "Unilateral primary pulmonary dysgenesis"), HPO_ID = c("HP:0001159", "HP:0001629", "HP:0006695", "HP:0001631", "HP:0011545", "HP:0034057", "HP:0045005", "HP:0000589", "HP:0008551", "HP:0011252", "HP:0000378", "HP:0009902", "HP:0000369", "HP:0100663", "HP:0001360", "HP:0100548", "HP:0010866", "HP:0002101", "HP:0006549"))

get HPO metadata table for all descendant terms of 'phenotypic abnormality'

hpo_meta <- HPOExplorer::make_phenos_dataframe("HP:0000118")

get meta info for congenital onset phenotypes

congenital_onset_dt <- merge(congenital_onset_dt, hpo_meta)

phenos + definition for prompt, note that some don't have a definition in the hpo_meta table

phenos <- paste( paste0(congenital_onset_dt[[1]], " - ",congenital_onset_dt[[7]]), collapse="; " )

phenos <- gsub("\"\"","'", phenos)

define chatGPT prompt

question = paste("Do:", phenos, ", typically cause:", effects, "Do they have congenital onset?", "You must give one-word yes or no answers and give a justification for why they do or don't have congenital onset.", "You must provide the output in .tsv format with columns:", table_columns) question <- gsub("\n", "", question)

run chatgpt 5 times for the same prompt

n = 5 run_chatgpt <- function(q){ all_res <- gptstudio::openai_create_chat_completion(prompt = question) choices <- fread(all_res[["choices"]]$message.content) }

res_multiPheno_def <- lapply(seq_len(n), function(x) run_chatgpt(1))

res_multiPheno_def_dt <- data.table::rbindlist(res_multiPheno_def,fill = TRUE, use.names = TRUE, idcol = "iteration")

order alphabetically so that you can compare results across phenotypes

res_multiPheno_def_dt <- res_multiPheno_def_dt[order(res_multiPheno_def_dt$phenotype), ]

Here is a subset of `res_multiPheno_def_dt`. Including the definition in the prompt seems to: (i) improve consistency in results but (ii) reduces accuracy e.g. coloboma doesn't seem to be associated with mental retardation, and Atrioventricular canal defect does not 'typically' cause if there is surgical intervention (see below the table for a more detailed answer for this phenotype from chatGPT). 

iteration | phenotype | mental retardation | death | impaired mobility | physical malformations | blindness | sensory impairments | immunodeficiency | cancer | reduced fertility | congenital onset | justification
-- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | --
1 | Atrioventricular canal defect | Yes | Yes | No | Yes | No | No | No | No | No | Yes | Present at birth (congenital).
2 | Atrioventricular canal defect | Yes | Yes | Yes | Yes | No | No | No | No | No | Yes | This condition is present at birth and affects the heart.
3 | Atrioventricular canal defect | Yes | Yes | No | Yes | No | No | No | No | No | Yes | This is a defect in the atrioventricular septum of the heart which is a congenital defect.
4 | Atrioventricular canal defect | Yes | Yes | No | Yes | No | No | No | No | No | Yes | The term refers to a congenital heart defect that is present at birth (congenital).
5 | Atrioventricular canal defect | Yes | Yes | No | Yes | No | No | No | No | No | Yes | Congenital onset is specified in the definition.
1 | Cleft helix | No | No | No | Yes | No | No | No | No | No | Yes | Present at birth (congenital).
2 | Cleft helix | No | No | No | Yes | No | No | No | No | No | Yes | This is a congenital abnormality that affects the ear.
3 | Cleft helix | No | No | No | Yes | No | No | No | No | No | Yes | Cleft helix is a defect that is present since birth.
4 | Cleft helix | No | No | No | Yes | No | No | No | No | No | Yes | The term refers to a developmental defect of the helix of the ear that is present at birth (congenital).
5 | Cleft helix | No | No | No | Yes | No | No | No | No | No | Yes | Congenital onset is specified in the definition.
1 | Coloboma | Yes | No | No | Yes | Yes | Yes | No | No | No | Yes | Develops during fetal development and is present at birth (congenital).
2 | Coloboma | Yes | No | No | Yes | Yes | No | No | No | No | Yes | This is a developmental defect that is present at birth.
3 | Coloboma | Yes | Yes | No | Yes | Yes | Yes | No | No | No | Yes | Coloboma is a developmental defect that occurs during embryonic development.
4 | Coloboma | Yes | No | No | Yes | Yes | Yes | No | No | No | Yes | The term refers to a developmental defect of the eye that is present at birth (congenital).
5 | Coloboma | Yes | No | No | Yes | Yes | Yes | No | No | No | Yes | Congenital onset is specified in the definition.
1 | Cryptotia | No | No | Yes | Yes | No | No | No | No | No | Yes | Present at birth (congenital).
2 | Cryptotia | No | No | No | Yes | No | No | No | No | No | Yes | This is a congenital abnormality that affects the ear.
3 | Cryptotia | No | No | No | Yes | No | No | No | No | No | Yes | Cryptotia is present at birth.
4 | Cryptotia | No | No | No | Yes | No | No | No | No | No | Yes | The term refers to a developmental defect of the auricle of the ear that is present at birth (congenital).
5 | Cryptotia | No | No | No | Yes | No | No | No | No | No | Yes | Congenital onset is specified in the definition.
1 | Cupped ear | No | No | No | Yes | No | No | No | No | No | Yes | Present at birth (congenital).
2 | Cupped ear | No | No | No | Yes | No | No | No | No | No | Yes | This is a congenital abnormality that affects the ear.
3 | Cupped ear | No | No | No | Yes | No | No | No | No | No | Yes | This is a defect in ear folding which occurs during embryonic development.
4 | Cupped ear | No | No | No | Yes | No | No | No | No | No | Yes | The term refers to a developmental defect of the ear that is present at birth (congenital).
5 | Cupped ear | No | No | No | Yes | No | No | No | No | No | Yes | Congenital onset is specified in the definition.
1 | Exstrophy | Yes | Yes | Yes | Yes | No | No | No | No | No | Yes | Present at birth (congenital).
2 | Exstrophy | No | No | No | Yes | No | No | No | No | No | Yes | This is a developmental defect that is present at birth.
3 | Exstrophy | No | No | Yes | Yes | No | No | No | No | No | Yes | Exstrophy is a result of developmental defects in embryonic development.
4 | Exstrophy | No | No | Yes | Yes | No | No | No | No | No | Yes | The term refers to a developmental defect of the abdominal wall that is present at birth (congenital).
5 | Exstrophy | Yes | Yes | Yes | Yes | No | No | No | No | No | Yes | Congenital onset is specified in the definition.

<img width="865" alt="Screenshot 2023-03-27 at 11 14 53 am" src="https://user-images.githubusercontent.com/56632280/227913284-6f0a6968-813e-4d1b-a9d8-5c7b098fd41b.png">

## Attempt #4 
Here I'm repeating attempt #2 with the addition of providing chatGPT with the definition of each congenital onset term. 
results_list <- list() 

for (j in 1:3) { res_indPheno_def <- lapply(seq_len(nrow(congenital_onset_dt)), function(i){ pheno <- congenital_onset_dt$preferredlabel[[i]] definition <- congenital_onset_dt$definition[[i]] question <- paste("Does", pheno, "-", definition, ", typically cause:", effects, "Does", pheno, "have congenital onset?", "You must give one-word yes or no answers and give a justification for why it does or doesn't have congenital onset.", "You must provide the output in .tsv format with columns:", table_columns) question <- gsub("\n", "", question) question <- gsub(". , typically", ", typically", question) all_res <- gptstudio::openai_create_chat_completion(prompt = question) choices <- fread(all_res[["choices"]]$message.content) }) results_list[[j]] <- res_indPheno_def }

list <- unlist(results_list, recursive = FALSE)

res_indPheno_def_dt <- data.table::rbindlist(list,fill = TRUE, use.names = TRUE, idcol = "iteration")

order alphabetically so that you can compare results across phenotypes

res_indPheno_def_dt <- res_individualPheno_dt[order(res_individualPheno_dt$phenotype), ]



Here is a subset of `res_indPheno_def_dt`.

iteration | phenotype | mental retardation | death | impaired mobility | physical malformations | blindness | sensory impairments | immunodeficiency | cancer | reduced fertility | congenital onset | justification | justification
-- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- 
5 | Atrioventricular canal defect | No | No | No | Yes | No | No | No | No | No | Yes | Cause is a defect of the atrioventricular septum which develops during fetal development, making it congenital. | NA
24 | Atrioventricular canal defect | No | No | No | Yes | No | No | No | No | No | Yes | Atrioventricular canal defect is a congenital heart defect, meaning it is present at birth. | NA
43 | Atrioventricular canal defect | No | Yes | No | Yes | No | No | No | No | No | Yes | Atrioventricular canal defect is a congenital heart defect that is present at birth. | NA
6 | Cleft helix | No | No | No | Yes | No | No | No | No | No | Yes | Cleft helix is a congenital malformation that occurs during fetal development. | NA
25 | Cleft helix | No | No | No | Yes | No | No | No | No | No | Yes | Cleft helix is a physical malformation that is present at birth and affects the ear. | NA
44 | Cleft helix | No | No | No | Yes | No | No | No | No | No | Yes | Cleft helix is a physical malformation of the ear that is present at birth, indicating a congenital onset. | NA
7 | Coloboma | No | No | No | Yes | Yes | No | No | No | No | Yes | Coloboma is a congenital condition as it results from incomplete closure of the optic fissure during embryonic development, which occurs during the early stages of fetal development. | NA
26 | Coloboma | No | No | No | Yes | Yes | Yes | No | No | No | Yes | Coloboma is a developmental defect that is present at birth, therefore it has a congenital onset. | NA
45 | Coloboma | no | no | no | yes | yes | yes | no | no | no | yes | It is a developmental defect, meaning it occurs during fetal development and is present at birth. | NA
8 | Cryptotia | No | No | No | Yes | No | No | No | No | No | Yes | Cryptotia is a congenital condition, meaning it is present at birth. It is caused by abnormal development of the ear during fetal development. | NA
27 | Cryptotia | No | No | No | Yes | No | No | No | No | No | Yes | Cryptotia is a congenital anomaly caused by abnormal development of the auricle in utero. | NA
46 | Cryptotia | No | No | No | Yes | No | No | No | No | No | Yes | Cryptotia is a congenital anomaly that develops during fetal growth and is present at birth. | NA
9 | Cupped ear | No | No | No | Yes | No | No | No | No | No | Yes | Cupped ear is a physical malformation that is present at birth, thus it has a congenital onset. | NA
28 | Cupped ear | No | No | No | Yes | No | No | No | No | No | Yes | Cupped ear is a physical malformation that is present at birth, indicating congenital onset. | NA
47 | Cupped ear | No | No | No | Yes | No | No | No | No | No | Yes | Cupped ear is a physical malformation that is present at birth and does not develop later in life. Therefore, it has a congenital onset. | NA
10 | Exstrophy | No | No | Yes | Yes | No | No | No | No | Yes | Yes | Exstrophy is a congenital birth defect that occurs during fetal development. | NA
29 | Exstrophy | No | No | Yes | Yes | No | No | No | No | Yes | Yes | Exstrophy is a congenital abnormality, present at birth. | NA
48 | Exstrophy | No | No | Yes | Yes | No | No | No | No | Yes | Yes | Exstrophy is a congenital condition that occurs during. | NA

@bschilder @NathanSkene 
NathanSkene commented 1 year ago

That prompt is not including the description of the phenotype is it?

Sent from Outlook for iOShttps://aka.ms/o0ukef


From: Kitty Murphy @.> Sent: Sunday, March 26, 2023 11:55:25 AM To: neurogenomics/RareDiseasePrioritisation @.> Cc: Skene, Nathan G @.>; Mention @.> Subject: Re: [neurogenomics/RareDiseasePrioritisation] Annotate diseases/phenotypes using chatGPT (Issue #19)

This email from @.*** originates from outside Imperial. Do not click on links and attachments unless you recognise the sender. If you trust the sender, add them to your safe senders listhttps://spam.ic.ac.uk/SpamConsole/Senders.aspx to disable email stamping for this address.

Annotating HPO phenotypes using chatGPT via gptstudio Set up

install.packages("gptstudio") library(gptstudio)

Load HPO terms

terms_dt = HPOExplorer::load_phenotype_to_genes(3) terms_cols = list(name="Phenotype", id="ID")

Get unique terms and their ID's

terms_dt_sub <.- unique(terms_dt[,unname(unlist(terms_cols)), with=FALSE])

Attempt #1https://github.com/neurogenomics/RareDiseasePrioritisation/issues/1

Here I'm using the congenital onset terms (without HPO ID) that were provided to us by Peter Robinson. Will also try:

congenital onset terms without HPO ID

congenital_onset <- "Syndactyly; Ventricular septal defect; Atrioventricular canal defect; Atrial septal defect; Abnormal connection of the cardiac segments; Fetal anomaly; Neural tube defect; Coloboma; Microtia; Cryptotia; Cupped ear; Cleft helix; Low-set ears; Synotia; Holoprosencephaly; Exstrophy; Abdominal wall defect; Abnormal lung lobation; Unilateral primary pulmonary dysgenesis"

define the effects you need answers to e.g. does the phenotype cause death

effects <- "mental retardation, death, impaired mobility, physical malformations, blindness, sensory impairments, immunodeficiency, cancer, reduced fertility."

define the columns of the output table

table_columns <- "phenotype, mental retardation, death, impaired mobility, physical malformations, blindness, sensory impairments, immunodeficiency, cancer, reduced fertility, congenital onset, jusitification."

define chatGPT prompt

question = paste("Do:", congenital_onset, ", typically cause:", effects, "Do they have congenital onset?", "You must give one-word yes or no answers and give a justification for why they do or don't have congenital onset.", "You must provide the output in .tsv format with columns:", table_columns) question <- gsub("\n", "", question)

run chatgpt 5 times for the same prompt

n = 5 run_chatgpt <- function(q){ all_res <- gptstudio::openai_create_chat_completion(prompt = question) choices <- fread(all_res[["choices"]]$message.content) }

res_allPheno <- lapply(seq_len(n), function(x) run_chatgpt(1))

res_allPheno_dt <- data.table::rbindlist(res_list,fill = TRUE, use.names = TRUE, idcol = "iteration")

order alphabetically so that you can compare results across phenotypes

res_allPheno_dt <- res_allPheno_dt [order(res_allPheno_dt $phenotype), ]

Below is a subset of res_allPheno_dt. The answers chatGPT gives over iterations of the same prompt are not consistent e.g. look at mental retardation for coloboma. A coloboma is an area of missing tissue in your eye, and through a quick google search is not associated with mental retardation.

iteration phenotype mental retardation death impaired mobility physical malformations blindness sensory impairments immunodeficiency cancer reduced fertility congenital onset justification 1 Atrioventricular canal defect Yes Yes Yes Yes No No No No No Yes Congenital heart defect present at birth 2 Atrioventricular canal defect Yes, in some cases May lead to premature death no May lead to growth failure, fatigue or rapid breathing May lead to vision problems None None None No AV canal defect is present at birth and is a congenital condition. 3 Atrioventricular canal defect Yes Yes No Yes No No No No No Yes Atrioventricular canal defect is a congenital heart defect in which there is an opening in the center of the heart where the walls separating the heart chambers should be. 4 Atrioventricular canal defect Yes Possible None Physical malformations No No No No No Yes Congenital onset is typical of this phenotype as it is a result of abnormal development of the heart during fetal development. 5 Atrioventricular canal defect Yes Yes No Yes No No No No No Yes It is a congenital heart defect that is present at birth. 1 Cleft helix No No No Yes No No No No No Yes Congenital ear malformation present at birth 2 Cleft helix No None None May lead to physical malformations of the ear None None None None Yes Cleft helix is present at birth and is a congenital condition. 3 Cleft helix No No No Yes No No No No No Yes Cleft helix is a congenital anomaly characterized by a cleft or gap in the top part of the ear. 4 Cleft helix No None None Physical malformations No No No No No Yes Congenital onset is typical of this phenotype as it is a result of incomplete development of the ear during fetal development. 5 Cleft helix No No No Yes No No No No No Yes A cleft helix is a rare congenital malformation of the ear. 1 Coloboma Yes No No Yes Yes Yes No No No Yes Present at birth and can affect vision and eye structure 2 Coloboma No May lead to vision problems or blindness May depend on location on the body None May lead to vision problems or blindness May lead to hearing loss or deafness None None No Coloboma is present at birth and is a congenital condition. 3 Coloboma Yes No No Yes Yes Yes No No No Yes Coloboma is a congenital anomaly characterized by a gap or hole in one of the structures of the eye. 4 Coloboma No None None Physical malformations Possible Possible No No No Yes Congenital onset is typical of this phenotype as it is a result of incomplete fusion of the tissues that form the eye during fetal development. 5 Coloboma Yes No No Yes Yes No No No No Yes A coloboma is a birth defect that affects the eye. 1 Cryptotia No No No Yes No No No No No Yes Congenital ear malformation present at birth 2 Cryptotia No None None May lead to physical malformations of the ear None None None None Yes Cryptotia is present at birth and is a congenital condition. 3 Cryptotia No No No Yes No No No No No Yes Cryptotia is a congenital anomaly characterized by a hidden ear that is partially or completely covered by skin. 4 Cryptotia No None None Physical malformations No No No No No Yes Congenital onset is typical of this phenotype as it is a result of abnormal development of the ear during fetal development. 5 Cryptotia No No No Yes No No No No No Yes Cryptotia is a congenital ear deformity. 1 Cupped ear No No No Yes No No No No No Yes Congenital ear malformation present at birth 2 Cupped ear No None None May lead to physical malformations of the ear None None None None Yes Cupped ear is present at birth and is a congenital condition. 3 Cupped ear No No No Yes No No No No No Yes Cupped ear is a congenital anomaly characterized by an ear that is shaped like a cup and protrudes outward from the side of the head. 4 Cupped ear No None None Physical malformations No No No No No Yes Congenital onset is typical of this phenotype as it is a result of abnormal development of the ear during fetal development. 5 Cupped ear No No No Yes No No No No No Yes A cupped ear is a congenital malformation. 1 Exstrophy Yes No Yes Yes No No No No No Yes Present at birth and affects bladder and pelvic development 2 Exstrophy No None None May lead to physical malformations of the abdominal wall or pelvic organs None None None May lead to reduced fertility Yes Exstrophy is present at birth and is a congenital condition. 3 Exstrophy Yes No Yes Yes No No No No No Yes Exstrophy is a congenital anomaly characterized by a defect in the abdominal wall or bladder. 4 Exstrophy No None None Physical malformations No No No No No Yes Congenital onset is typical of this phenotype as it is a result of abnormal development of the abdominal wall during fetal development. 5 Exstrophy Yes No Yes Yes No No No No No Yes Exstrophy is a congenital abnormality where the bladd Attempt #2https://github.com/neurogenomics/RareDiseasePrioritisation/issues/2

What if I run the prompt one phenotype at a time, with 3 iterations?

congenital_onset_split <- as.list(strsplit(congenital_onset, "; ")[[1]])

results_list <- list()

for (j in 1:3) { res_individualPheno <- lapply(seq_len(length(congenital_onset_split)), function(i){ pheno <- congenital_onset_split[[i]] question = paste("Does", pheno, "typically cause:", effects, "Does", pheno, "have congenital onset?", "You must give one-word yes or no answers and give a justification for why it does or doesn't have congenital onset.", "You must provide the output in .tsv format with columns:", table_columns) question <- gsub("\n", "", question) print(question) all_res <- gptstudio::openai_create_chat_completion(prompt = question) choices <- fread(all_res[["choices"]]$message.content) return(choices) }) results_list[[j]] <- res_individualPheno_list # store the result in the list }

list <- unlist(res_individualPheno_list, recursive = FALSE)

res_individualPheno_dt <- data.table::rbindlist(list,fill = TRUE, use.names = TRUE, idcol = "iteration")

order alphabetically so that you can compare results across phenotypes

res_individualPheno_dt <- res_individualPheno_dt[order(res_individualPheno_dt$phenotype), ]

Below is a subset of res_individualPheno_dt, I've shown the same phenotypes as for res_allPheno_dt for comparison. There seems to be more consistency across the iterations when you run chatgpt on each phenotype individually.

phenotype mental retardation death impaired mobility physical malformations blindness sensory impairments immunodeficiency cancer reduced fertility congenital onset justification Atrioventricular canal defect no no no yes no no no no no yes NA Atrioventricular canal defect No No No Yes No No No No No Yes NA Atrioventricular canal defect No No No Yes No No No No No Yes NA Cleft helix No No No Yes No No No No No Yes NA Cleft helix No No No Yes No No No No No Yes NA Cleft helix No No No Yes No No No No No Yes NA Coloboma No No No Yes Yes Yes No No No Yes NA Coloboma No No No Yes Yes Yes No No No Yes NA Coloboma no no no yes yes yes no no no yes NA Cryptotia No No No Yes No No No No No Yes NA Cryptotia No No No Yes No No No No No Yes NA Cryptotia No No No Yes No No No No No Yes NA Cupped ear No No No Yes No No No No No Yes NA Cupped ear no no no yes no no no no no yes NA Cupped ear No No No Yes No No No No No Yes NA Exstrophy No No Yes Yes No No No No Yes Yes NA Exstrophy No No Yes Yes No No No Yes Yes Yes NA Exstrophy No No Yes Yes No No No No Yes Yes NA

@bschilderhttps://github.com/bschilder @NathanSkenehttps://github.com/NathanSkene

— Reply to this email directly, view it on GitHubhttps://github.com/neurogenomics/RareDiseasePrioritisation/issues/19#issuecomment-1484059859, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AH5ZPE5L3RZIYKPKXP4QW3TW6AVC3ANCNFSM6AAAAAAWBOOU2U. You are receiving this because you were mentioned.Message ID: @.***>

bschilder commented 1 year ago

Nice progress @KittyMurphy . That's interesting about the responses being more consistent when provided individually. Wondering if this has to with informational overload like we were discussing before. Might be an aspect of chatGPT that other people have noticed and documented.

One thing that would be helpful is to come up with a function that computes consistently scores for each metric. That will give us at least some quantitative metric of performance (tho not exactly the ground truth). Something like:

dat=xlsx::read.xlsx("~/Downloads/annot.xlsx",1)
avg <- dplyr::group_by(dat, phenotype) |> dplyr::summarise( mental.retardation_consistency=1/length(unique(mental.retardation)))
avg
Screenshot 2023-03-26 at 14 34 57

After computing the within phenotype consistency, you can compute mean consistency:

mean(avg$mental.retardation_consistency)
# 0.75

That prompt is not including the description of the phenotype is it?

@NathanSkene I believe this is only providing the chatGPT with the name of the phenotype, not the full description of it. Thus, any other information about the disease is being pulled from the LLM itself.

NathanSkene commented 1 year ago

Good idea to get some stats on it. Could also use scoring to compare ChatGPt3 vs 4 consistency: expect some folks will be interested.

Including the HPO description might help it get a more consistent understanding of what the phenotype is. Brian, do you know how the descriptions can be accessed programmatically?

Sent from Outlook for iOShttps://aka.ms/o0ukef


From: Brian M. Schilder @.> Sent: Sunday, March 26, 2023 3:53:41 PM To: neurogenomics/RareDiseasePrioritisation @.> Cc: Skene, Nathan G @.>; Mention @.> Subject: Re: [neurogenomics/RareDiseasePrioritisation] Annotate diseases/phenotypes using chatGPT (Issue #19)

This email from @.*** originates from outside Imperial. Do not click on links and attachments unless you recognise the sender. If you trust the sender, add them to your safe senders listhttps://spam.ic.ac.uk/SpamConsole/Senders.aspx to disable email stamping for this address.

Nice progress @KittyMurphyhttps://github.com/KittyMurphy . That's interesting about the responses being more consistent when provided individually. Wondering if this has to with informational overload like we were discussing before. Might be an aspect of chatGPT that other people have noticed and documented.

One thing that would be helpful is to come up with a function that computes consistently scores for each metric. That will give us at least some quantitative metric of performance (tho not exactly the ground truth). Something like:

dat=xlsx::read.xlsx("~/Downloads/annot.xlsx",1) dplyr::group_by(dat, phenotype) |> dplyr::summarise( mental.retardation_consistency=1/length(unique(mental.retardation)))

[Screenshot 2023-03-26 at 14 34 57]https://user-images.githubusercontent.com/34280215/227779366-f8ee8286-30af-486f-b39f-21d3c6ce5767.png

That prompt is not including the description of the phenotype is it?

@NathanSkenehttps://github.com/NathanSkene I believe this is only providing the chatGPT with the name of the phenotype, not the full description of it. Thus, any other information about the disease is being pulled from the LLM itself.

— Reply to this email directly, view it on GitHubhttps://github.com/neurogenomics/RareDiseasePrioritisation/issues/19#issuecomment-1484120736, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AH5ZPE7U7FACDH2TKEAWITTW6BJ7LANCNFSM6AAAAAAWBOOU2U. You are receiving this because you were mentioned.Message ID: @.***>

KittyMurphy commented 1 year ago

Already working on adding the description, @bschilder I assume the best way to get this is to use the definition column in HPOExplorer::make_phenos_dataframe?

bschilder commented 1 year ago

Already working on adding the description, @bschilder I assume the best way to get this is to use the definition column in HPOExplorer::make_phenos_dataframe?

Yeah, that'll work. Or the subfunction which is more direct: HPOExplorer::add_hpo_definition()

NathanSkene commented 1 year ago

The current prompts do not include a statement for "Do not consider indirect effects". Would be worth adding this in and seeing if it makes any difference.

bschilder commented 1 year ago

I tried out AutoGPT to see if this might be a useful avenue. Here’s what I learned:

Pros

  1. It can search the internet, via APIs or via Selenium queries. For example, if you ask it something it’s unsure about, it can read the relevant literature/databases on the topic to gain more expertise in that area.
  2. It has built-in python code for reading/writing code or other files. This means no need to copy-and-paste output from the browser interface. Using this feature I was able to tell it to read in a series of CSVs with 100 HPO terms each (that I had created beforehand) so that each query was a manageable size that didn’t exceed the token limit.
  3. There is a dedicated Docker container to run AutoGPT. The instructions are not super straightforward (or correct) but after some troubleshooting and checking the GitHub Issues i was able to get things working. I took notes on exactly how to do this and will share.

    Cons

  4. As very few people have API access to GPT4 atm, it means that when we use AutoGPT we can only use the GPT3.5-turbo model. As you know, this is not as sophisticated of a model and will do thing like write lazy code that just assigns the same annotations to every phenotype, or simply do substring searches for the term “blindness” within the HPO term itself (which isn’t very useful).
  5. It requires you to have a paid OpenAI account. In the interest of time, I just entered my personal credit card details. It’s actually not too bad; after a whole day or making hundreds of queries I only racked up $1.46 in charges. But still something to do mindful of.
  6. It’s very tricky to get it to do what you actually want, and requires a lot of trial-and-error to get it close. This will hopefully be better with GPT4, but in the meantime i wasn’t able to get it to produce any kind of meaningful annotation for the HPO terms.
bschilder commented 1 year ago

Here is my favorite example of how AutoGPT can be very lazy 😅

Screenshot 2023-05-05 at 22 27 30
KittyMurphy commented 1 year ago

I have now performed a trial run to annotate phenotypes using chat gpt via selenium. Initially we asked gpt to provide the output in .tsv format but I had difficulty trying to extract this from the chat interface into python. To overcome this, I asked gpt to provide the output as python code that I could then run to generate a data frame. @bschilder noted that earlier versions of gpt could sometimes be lazy when asking for code.

Here is a prompt example: "I need to annotate phenotypes as to whether they typically cause: intellectual disability, death, impaired mobility, physical malformations, blindness, sensory impairments, immunodeficiency, cancer, reduced fertility? Do they have congenital onset? You must give one-word yes or no answers. Do not consider indirect effects. You must provide the output in python code as a data frame called df with columns: phenotype, intellectual_disability, death, impaired_mobility, physical_malformations, blindness, sensory_impairments, immunodeficiency, cancer, reduced_fertility, congenital_onset, justification. These are the phenotypes: Abnormality of body height; Multicystic kidney dysplasia; Autosomal dominant inheritance; Autosomal recessive inheritance; Abnormal morphology of female internal genitalia; Functional abnormality of the bladder; Recurrent urinary tract infections; Neurogenic bladder; Urinary urgency; Hypoplasia of the uterus; Abnormality of the bladder; Bladder diverticulum"

Here is the trial run using ~100 phenotypes (note, there are ~200 because I think I appended the results twice by mistake): annot_HPO_gpt_test.csv

@NathanSkene noted that the phenotype 'Azoospermia' is not being annotated as reducing fertility. This is worrying as upon a literature search of this phenotype: "Azoospermia is the complete absence of spermatozoa in the ejaculate. It is the most severe and one of the leading causes of male infertility. The exact pathophysiology of azoospermia is not always known. Azoospermia can be due to pre-testicular, testicular, and post-testicular causes."

Next, I want to:

bschilder commented 1 year ago

Thanks @KittyMurphy !

A couple of other ideas for reducing token usage (tho whether this helps will depend on how OpenAI counts 'tokens', which i'm still not totally clear on):

bschilder commented 1 year ago

Annotation output checks

All of the following annotation validation procedures described below can be rerun with any new annotations using the new internal function: HPOExplorer:::check_annot_gpt https://github.com/neurogenomics/HPOExplorer/blob/master/R/check_annot_gpt.R

Check phenotype names

Check whether chatGPT hasn't modified the phenotype names such that we can't link it back to the input HPO terms.

  d <- data.table::fread(path, key = "Phenotype")
  annot <- HPOExplorer::load_phenotype_to_genes()
  d$Phenotype[!d$Phenotype %in% annot$Phenotype]
# character(0)

✅ All phenotypes in HPO gene annotations file verbatim.

Check annotation consistency

For phenotype that chatGPT annotated more than once, how consistent are the Y/N annotations it gave for each?

 nm <- names(d)[!names(d) %in% c("Phenotype","Justification")]
  d_mean <- d[,lapply(.SD,function(x){mean(x=="Yes")}),.SDcols=nm, by="Phenotype"]
  d_consist <- lapply(d_mean[,-1], function(x)sum(x%in%c(0,1)/nrow(d_mean)))
d_consist
$Intellectual_Disability
[1] 1

$Death
[1] 1

$Impaired_Mobility
[1] 1

$Physical_Malformations
[1] 1

$Blindness
[1] 1

$Sensory_Impairments
[1] 1

$Immunodeficiency
[1] 1

$Cancer
[1] 1

$Reduced_Fertility
[1] 0.7708333

$Congenital_Onset
[1] 1
mean(unlist(d_consist))
#  0.9770833

✅ At least In this small subsampling, 9/10 annotation columns are 100% consistent across chatGPT runs. This results in an average consistency score of 97.7% across all annotations. "Reduced_Fertility" is one to look out for, as it does not appear to always provide the same annotation here (77%, which may seem not too bad but remember that baseline is 50% as the options are binary).

Check phenotype classifications

As some of these phenotypes belong to specific branches of the HPO that should guarantee have a particular annotation (e.g. all forms of blindness phenotypes cause Blindness ('Yes'), we can use this information to validate the chatGPT-provided annotations.

While we can confirm annotations that we would expect (true positives vs. false negatives), this doesn't really let us definitively says whether some phenotypes do NOT cause a given condition such as blindness (true negatives).

d$HPO_ID <- harmonise_phenotypes(phenotypes = d$Phenotype,
                                   as_hpo_ids = TRUE)
  ## Find matching HPO branches
  hpo <- get_hpo() 
  queries <- list(
    Intellectual_Disability=c("intellectual disability"),
    Impaired_Mobility=c("Abnormal central motor function",
                        "Abnormality of movement"),
    Physical_Malformations=c("malformation","morphology"),
    Blindness=c("^blindness"),
    Sensory_Impairments=c("Abnormality of vision",
                          "Abnormality of the sense of smell",
                          "Abnormality of taste sensation",
                          "Somatic sensory dysfunction",
                          "Hearing abnormality"
                          ),
    Immunodeficiency=c("Immunodeficiency"),
    Cancer=c("Neoplasm","Cancer"),
    Reduced_Fertility=c("Decreased fertility")
    ) 
  tiers <- lapply(queries, function(q){
    terms <- grep(paste(q,collapse = "|"),
         hpo$name,
         ignore.case = TRUE, value = TRUE)
    ontologyIndex::get_descendants(ontology = hpo,
                                   roots = names(terms),
                                   exclude_roots = FALSE) |>
      unique()
  })
  annot_check <- lapply(seq_len(nrow(d)), function(i){
    r <- d[i,]
    cbind(
      r[,c("Phenotype","HPO_ID")],
      lapply(stats::setNames(names(tiers),names(tiers)),
             function(x){
               if(r$HPO_ID %in% tiers[[x]]){
                 r[,x,with=FALSE][[1]]=="Yes"
               } else {
                 NA
               }
             }) |> data.table::as.data.table()
    )
  }) |> data.table::rbindlist()

### Number of rows where annotation is NA
  missing_rate <- sapply(
    annot_check[,names(tiers),with=FALSE],
    function(x){sum(is.na(x))/length(x)})
missing_rate
Intellectual_Disability       Impaired_Mobility  Physical_Malformations 
              1.0000000               1.0000000               0.4558824 
              Blindness     Sensory_Impairments        Immunodeficiency 
              1.0000000               1.0000000               1.0000000 
                 Cancer       Reduced_Fertility 
              0.9901961               0.9607843 

True positive rate

### Number of rows where the annotation was checkable and TRUE
true_pos_rate <- sapply(annot_check[,names(tiers),with=FALSE], function(x){sum(na.omit(x)==TRUE)/length(na.omit(x))})
true_pos_rate 
Intellectual_Disability       Impaired_Mobility  Physical_Malformations 
                    NaN                     NaN               0.5765766 
              Blindness     Sensory_Impairments        Immunodeficiency 
                    NaN                     NaN                     NaN 
                 Cancer       Reduced_Fertility 
              1.0000000               0.5000000 

False negative rate

### Number of rows where the annotation was checkable and FALSE
false_neg_rate <- sapply(annot_check[,names(tiers),with=FALSE], function(x){sum(na.omit(x)==FALSE)/length(na.omit(x))})
false_neg_rate
Intellectual_Disability       Impaired_Mobility  Physical_Malformations 
                    NaN                     NaN               0.4234234 
              Blindness     Sensory_Impairments        Immunodeficiency 
                    NaN                     NaN                     NaN 
                 Cancer       Reduced_Fertility 
              0.0000000               0.5000000 
KittyMurphy commented 1 year ago

I have since updated the prompt twice.

Example prompt 1.1: I need to annotate phenotypes as to whether they typically cause: intellectual disability, death, impaired mobility, physical malformations, blindness, sensory impairments, immunodeficiency, cancer, reduced fertility? Do they always have congenital onset? You must give one-word yes or no answers. Do not consider indirect effects. You must provide the output in python code as a data frame called df with columns: phenotype, intellectual_disability, death, impaired_mobility, physical_malformations, blindness, sensory_impairments, immunodeficiency, cancer, reduced_fertility, congenital_onset. Also add justification columns for each outcome. These are the phenotypes: Recurrent urinary tract infections; Neurogenic bladder; Urinary urgency

Here are the results for ~500 phenotypes: gpt_hpo_annotations.csv. The issue here was that we were getting non yes or no answers for some of the phenotypic outcomes e.g. 'can be', 'may be'. To get around this, we decided to add a scale for the phenotypic outcomes, so instead of yes or no answers we ask chat gpt to answer using a scale of: never, rarely, often, always. Due to limited token usage we had to drop the number of phenotypes in each prompt to two.

Example prompt 1.2: I need to annotate phenotypes as to whether they typically cause: intellectual disability, death, impaired mobility, physical malformations, blindness, sensory impairments, immunodeficiency, cancer, reduced fertility? Do they have congenital onset? To answer, use a severity scale of: never, rarely, often, always. Do not consider indirect effects. You must provide the output in python code as a data frame called df with columns: phenotype, intellectual_disability, death, impaired_mobility, physical_malformations, blindness, sensory_impairments, immunodeficiency, cancer, reduced_fertility, congenital_onset. Also add justification columns for each outcome. These are the phenotypes: Urinary urgency; Hypoplasia of the uterus

Here are the results so far: gpt_hpo_annotations_scale.csv

Currently waiting for help from Eugene to get this set up on a remote machine so that it can run 24/7, and it will probably take ~2 weeks.

bschilder commented 1 year ago

@KittyMurphy I'm looking into some resources that might be helpful:

ChatGPT File uploader (google chrome extension) https://chrome.google.com/webstore/detail/chatgpt-file-uploader-ext/becfinhbfclcgokjlobojlnldbfillpf/

Bing Chat: Microsoft's iteration of ChatGPT: https://www.bing.com/

bschilder commented 9 months ago

Update

Stage 1

  1. [x] We first only ran GPT annotations for the 2,832 phenotypes that were significantly enriched for at least one cell type in our first round of analyses

Stage 2

  1. Then, we expanded to all 10,969 phenotypes that appeared within the HPO gene annotations file. This should be sufficient for the first Rare Disease Celltyping paper, as it allows us to prioritise all phenotypes relevant for that paper.
annot=HPOExplorer::load_phenotype_to_genes()
length(unique(annot$hpo_name))
# [1] 10969

@KittyMurphy is running the last of these now.

Stage 3

  1. Finally, we will further extend our GPT annotations to all phenotypes in the HPO, which is currently 18,057 total phenotypes. This will be used for the GPT annotations manuscript.
hpo=HPOExplorer::get_hpo()
> length(unique(hpo$name))
# [1] 18057
KittyMurphy commented 9 months ago

I've actually been using the below code to get the phenotypes:

annot <- HPOExplorer::make_phenos_dataframe()
length(unique(annot$hpo_id))
[1] 10954

I'll make sure I run the remaining 15 phenotypes that are called with HPOExplorer::load_phenotype_to_genes() but just wanted to flag the discrepancy between the two.

bschilder commented 9 months ago

@KittyMurphy make_phenos_dataframe calls load_phenotype_to_genes to get the data, so they should be the same (unless somehow certain phenotypes get filtered in the former function). https://github.com/neurogenomics/HPOExplorer/blob/master/R/make_phenos_dataframe.R

Could you check whether this discrepancy stems from :

  1. the functions themselves
  2. Different versions of HPO ontology/genes data (note, data is cached by default).
  3. Different versions of HPOExplorer