Closed bschilder closed 8 months ago
@KittyMurphy please document your progress on this here
install.packages("gptstudio")
library(gptstudio)
# Load HPO terms
terms_dt = HPOExplorer::load_phenotype_to_genes(3)
terms_cols = list(name="Phenotype",
id="ID")
# Get unique terms and their ID's
terms_dt_sub <.- unique(terms_dt[,unname(unlist(terms_cols)), with=FALSE])
Here I'm using the congenital onset terms (without HPO ID) that were provided to us by Peter Robinson. Will also try:
# congenital onset terms without HPO ID
congenital_onset <- "Syndactyly;
Ventricular septal defect; Atrioventricular canal defect;
Atrial septal defect; Abnormal connection of the cardiac segments;
Fetal anomaly; Neural tube defect;
Coloboma; Microtia; Cryptotia;
Cupped ear; Cleft helix; Low-set ears;
Synotia; Holoprosencephaly; Exstrophy;
Abdominal wall defect; Abnormal lung lobation;
Unilateral primary pulmonary dysgenesis"
effects <- "mental retardation, death, impaired mobility, physical malformations, blindness, sensory impairments, immunodeficiency, cancer, reduced fertility."
table_columns <- "phenotype, mental retardation, death, impaired mobility, physical malformations, blindness, sensory impairments, immunodeficiency, cancer, reduced fertility, congenital onset, jusitification."
question = paste("Do:", congenital_onset, ", typically cause:", effects, "Do they have congenital onset?", "You must give one-word yes or no answers and give a justification for why they do or don't have congenital onset.", "You must provide the output in .tsv format with columns:", table_columns) question <- gsub("\n", "", question)
n = 5 run_chatgpt <- function(q){ all_res <- gptstudio::openai_create_chat_completion(prompt = question) choices <- fread(all_res[["choices"]]$message.content) }
res_allPheno <- lapply(seq_len(n), function(x) run_chatgpt(1))
res_allPheno_dt <- data.table::rbindlist(res_list,fill = TRUE, use.names = TRUE, idcol = "iteration")
res_allPheno_dt <- res_allPheno_dt [order(res_allPheno_dt $phenotype), ]
Below is a subset of `res_allPheno_dt`. The answers chatGPT gives over iterations of the same prompt are not consistent e.g. look at mental retardation for coloboma. A coloboma is an area of missing tissue in your eye, and through a quick google search is not associated with mental retardation.
iteration | phenotype | mental retardation | death | impaired mobility | physical malformations | blindness | sensory impairments | immunodeficiency | cancer | reduced fertility | congenital onset | justification
-- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | --
1 | Atrioventricular canal defect | Yes | Yes | Yes | Yes | No | No | No | No | No | Yes | Congenital heart defect present at birth
2 | Atrioventricular canal defect | Yes, in some cases | May lead to premature death | no | May lead to growth failure, fatigue or rapid breathing | May lead to vision problems | None | None | None | No | AV canal defect is present at birth and is a congenital condition. |
3 | Atrioventricular canal defect | Yes | Yes | No | Yes | No | No | No | No | No | Yes | Atrioventricular canal defect is a congenital heart defect in which there is an opening in the center of the heart where the walls separating the heart chambers should be.
4 | Atrioventricular canal defect | Yes | Possible | None | Physical malformations | No | No | No | No | No | Yes | Congenital onset is typical of this phenotype as it is a result of abnormal development of the heart during fetal development.
5 | Atrioventricular canal defect | Yes | Yes | No | Yes | No | No | No | No | No | Yes | It is a congenital heart defect that is present at birth.
1 | Cleft helix | No | No | No | Yes | No | No | No | No | No | Yes | Congenital ear malformation present at birth
2 | Cleft helix | No | None | None | May lead to physical malformations of the ear | None | None | None | None | Yes | Cleft helix is present at birth and is a congenital condition. |
3 | Cleft helix | No | No | No | Yes | No | No | No | No | No | Yes | Cleft helix is a congenital anomaly characterized by a cleft or gap in the top part of the ear.
4 | Cleft helix | No | None | None | Physical malformations | No | No | No | No | No | Yes | Congenital onset is typical of this phenotype as it is a result of incomplete development of the ear during fetal development.
5 | Cleft helix | No | No | No | Yes | No | No | No | No | No | Yes | A cleft helix is a rare congenital malformation of the ear.
1 | Coloboma | Yes | No | No | Yes | Yes | Yes | No | No | No | Yes | Present at birth and can affect vision and eye structure
2 | Coloboma | No | May lead to vision problems or blindness | May depend on location on the body | None | May lead to vision problems or blindness | May lead to hearing loss or deafness | None | None | No | Coloboma is present at birth and is a congenital condition. |
3 | Coloboma | Yes | No | No | Yes | Yes | Yes | No | No | No | Yes | Coloboma is a congenital anomaly characterized by a gap or hole in one of the structures of the eye.
4 | Coloboma | No | None | None | Physical malformations | Possible | Possible | No | No | No | Yes | Congenital onset is typical of this phenotype as it is a result of incomplete fusion of the tissues that form the eye during fetal development.
5 | Coloboma | Yes | No | No | Yes | Yes | No | No | No | No | Yes | A coloboma is a birth defect that affects the eye.
1 | Cryptotia | No | No | No | Yes | No | No | No | No | No | Yes | Congenital ear malformation present at birth
2 | Cryptotia | No | None | None | May lead to physical malformations of the ear | None | None | None | None | Yes | Cryptotia is present at birth and is a congenital condition. |
3 | Cryptotia | No | No | No | Yes | No | No | No | No | No | Yes | Cryptotia is a congenital anomaly characterized by a hidden ear that is partially or completely covered by skin.
4 | Cryptotia | No | None | None | Physical malformations | No | No | No | No | No | Yes | Congenital onset is typical of this phenotype as it is a result of abnormal development of the ear during fetal development.
5 | Cryptotia | No | No | No | Yes | No | No | No | No | No | Yes | Cryptotia is a congenital ear deformity.
1 | Cupped ear | No | No | No | Yes | No | No | No | No | No | Yes | Congenital ear malformation present at birth
2 | Cupped ear | No | None | None | May lead to physical malformations of the ear | None | None | None | None | Yes | Cupped ear is present at birth and is a congenital condition. |
3 | Cupped ear | No | No | No | Yes | No | No | No | No | No | Yes | Cupped ear is a congenital anomaly characterized by an ear that is shaped like a cup and protrudes outward from the side of the head.
4 | Cupped ear | No | None | None | Physical malformations | No | No | No | No | No | Yes | Congenital onset is typical of this phenotype as it is a result of abnormal development of the ear during fetal development.
5 | Cupped ear | No | No | No | Yes | No | No | No | No | No | Yes | A cupped ear is a congenital malformation.
1 | Exstrophy | Yes | No | Yes | Yes | No | No | No | No | No | Yes | Present at birth and affects bladder and pelvic development
2 | Exstrophy | No | None | None | May lead to physical malformations of the abdominal wall or pelvic organs | None | None | None | May lead to reduced fertility | Yes | Exstrophy is present at birth and is a congenital condition. |
3 | Exstrophy | Yes | No | Yes | Yes | No | No | No | No | No | Yes | Exstrophy is a congenital anomaly characterized by a defect in the abdominal wall or bladder.
4 | Exstrophy | No | None | None | Physical malformations | No | No | No | No | No | Yes | Congenital onset is typical of this phenotype as it is a result of abnormal development of the abdominal wall during fetal development.
5 | Exstrophy | Yes | No | Yes | Yes | No | No | No | No | No | Yes | Exstrophy is a congenital abnormality where the bladd
### Attempt #2
What if I run the prompt one phenotype at a time, with 3 iterations?
congenital_onset_split <- as.list(strsplit(congenital_onset, "; ")[[1]])
results_list <- list()
for (j in 1:3) { res_individualPheno <- lapply(seq_len(length(congenital_onset_split)), function(i){ pheno <- congenital_onset_split[[i]] question = paste("Does", pheno, "typically cause:", effects, "Does", pheno, "have congenital onset?", "You must give one-word yes or no answers and give a justification for why it does or doesn't have congenital onset.", "You must provide the output in .tsv format with columns:", table_columns) question <- gsub("\n", "", question) print(question) all_res <- gptstudio::openai_create_chat_completion(prompt = question) choices <- fread(all_res[["choices"]]$message.content) return(choices) }) results_list[[j]] <- res_individualPheno_list }
list <- unlist(res_individualPheno_list, recursive = FALSE)
res_individualPheno_dt <- data.table::rbindlist(list,fill = TRUE, use.names = TRUE, idcol = "iteration")
res_individualPheno_dt <- res_individualPheno_dt[order(res_individualPheno_dt$phenotype), ]
Below is a subset of `res_individualPheno_dt`, I've shown the same phenotypes as for `res_allPheno_dt` for comparison. There seems to be more consistency across the iterations when you run chatgpt on each phenotype individually.
phenotype | mental retardation | death | impaired mobility | physical malformations | blindness | sensory impairments | immunodeficiency | cancer | reduced fertility | congenital onset | justification | justification
-- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- |--
Atrioventricular canal defect | no | no | no | yes | no | no | no | no | no | yes | NA | Defect occurs during fetal development, therefore present at birth.
Atrioventricular canal defect | No | No | No | Yes | No | No | No | No | No | Yes | NA | Atrioventricular canal defect is a congenital heart defect. It is present at birth and develops as the heart forms during fetal development.
Atrioventricular canal defect | No | No | No | Yes | No | No | No | No | No | Yes | NA | Atrioventricular canal defect is a congenital heart defect that occurs during fetal development.
Cleft helix | No | No | No | Yes | No | No | No | No | No | Yes | NA | Cleft helix is a genetic condition that is present at birth, thus indicating that it has a congenital onset.
Cleft helix | No | No | No | Yes | No | No | No | No | No | Yes | NA | Cleft helix is a genetic condition, meaning it is present at birth and caused by inherited gene mutations. It is a congenital condition.
Cleft helix | No | No | No | Yes | No | No | No | No | No | Yes | NA | Congenital onset is indicated by the presence of a physical malformation at birth, which is true for cleft helix.
Coloboma | No | No | No | Yes | Yes | Yes | No | No | No | Yes | NA | Congenital onset means present at birth, and coloboma is a congenital condition that occurs when certain structures in the eye or other parts of the body don't develop properly during fetal growth. Therefore, it has a congenital onset.
Coloboma | No | No | No | Yes | Yes | Yes | No | No | No | Yes | NA | Congenital onset refers to a condition that is present at or before birth. Coloboma is a congenital condition, as it occurs when the eye doesn't develop properly during pregnancy.
Coloboma | no | no | no | yes | yes | yes | no | no | no | yes | NA | Coloboma is a congenital birth defect that affects the eyes, and it is usually present from birth. It is caused by abnormal development of the eye during gestation.
Cryptotia | No | No | No | Yes | No | No | No | No | No | Yes | NA | Cryptotia is a congenital ear anomaly.
Cryptotia | No | No | No | Yes | No | No | No | No | No | Yes | NA | Cryptotia is a congenital ear malformation that is present at birth.
Cryptotia | No | No | No | Yes | No | No | No | No | No | Yes | NA | Cryptotia is a congenital condition, meaning it is present at or before birth.
Cupped ear | No | No | No | Yes | No | No | No | No | No | Yes | NA | Cupped ear is associated with physical malformations and is present at birth (congenital).
Cupped ear | no | no | no | yes | no | no | no | no | no | yes | NA | The development of an ear occurs during fetal development, hence the onset of cupped ear is congenital.
Cupped ear | No | No | No | Yes | No | No | No | No | No | Yes | NA | It is a congenital deformity that occurs during fetal development.
Exstrophy | No | No | Yes | Yes | No | No | No | No | Yes | Yes | NA | It is a birth defect that occurs during fetal development.
Exstrophy | No | No | Yes | Yes | No | No | No | Yes | Yes | Yes | NA | Exstrophy is a congenital anomaly that occurs during fetal development. The anterior body wall fails to properly fuse together, resulting in the exposure of internal organs.
Exstrophy | No | No | Yes | Yes | No | No | No | No | Yes | Yes | NA | Consequence of abnormal embryonic development
### Attempt #3
Here I'm repeating attempt #1 with the addition of providing chatGPT with the definition of each congenital onset term.
congenital_onset_dt <- data.table(preferredlabel = c("Syndactyly", "Ventricular septal defect", "Atrioventricular canal defect", "Atrial septal defect", "Abnormal connection of the cardiac segments", "Fetal anomaly", "Neural tube defect", "Coloboma", "Microtia", "Cryptotia", "Cupped ear", "Cleft helix", "Low-set ears", "Synotia", "Holoprosencephaly", "Exstrophy", "Abdominal wall defect", "Abnormal lung lobation", "Unilateral primary pulmonary dysgenesis"), HPO_ID = c("HP:0001159", "HP:0001629", "HP:0006695", "HP:0001631", "HP:0011545", "HP:0034057", "HP:0045005", "HP:0000589", "HP:0008551", "HP:0011252", "HP:0000378", "HP:0009902", "HP:0000369", "HP:0100663", "HP:0001360", "HP:0100548", "HP:0010866", "HP:0002101", "HP:0006549"))
hpo_meta <- HPOExplorer::make_phenos_dataframe("HP:0000118")
congenital_onset_dt <- merge(congenital_onset_dt, hpo_meta)
phenos <- paste( paste0(congenital_onset_dt[[1]], " - ",congenital_onset_dt[[7]]), collapse="; " )
phenos <- gsub("\"\"","'", phenos)
question = paste("Do:", phenos, ", typically cause:", effects, "Do they have congenital onset?", "You must give one-word yes or no answers and give a justification for why they do or don't have congenital onset.", "You must provide the output in .tsv format with columns:", table_columns) question <- gsub("\n", "", question)
n = 5 run_chatgpt <- function(q){ all_res <- gptstudio::openai_create_chat_completion(prompt = question) choices <- fread(all_res[["choices"]]$message.content) }
res_multiPheno_def <- lapply(seq_len(n), function(x) run_chatgpt(1))
res_multiPheno_def_dt <- data.table::rbindlist(res_multiPheno_def,fill = TRUE, use.names = TRUE, idcol = "iteration")
res_multiPheno_def_dt <- res_multiPheno_def_dt[order(res_multiPheno_def_dt$phenotype), ]
Here is a subset of `res_multiPheno_def_dt`. Including the definition in the prompt seems to: (i) improve consistency in results but (ii) reduces accuracy e.g. coloboma doesn't seem to be associated with mental retardation, and Atrioventricular canal defect does not 'typically' cause if there is surgical intervention (see below the table for a more detailed answer for this phenotype from chatGPT).
iteration | phenotype | mental retardation | death | impaired mobility | physical malformations | blindness | sensory impairments | immunodeficiency | cancer | reduced fertility | congenital onset | justification
-- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | --
1 | Atrioventricular canal defect | Yes | Yes | No | Yes | No | No | No | No | No | Yes | Present at birth (congenital).
2 | Atrioventricular canal defect | Yes | Yes | Yes | Yes | No | No | No | No | No | Yes | This condition is present at birth and affects the heart.
3 | Atrioventricular canal defect | Yes | Yes | No | Yes | No | No | No | No | No | Yes | This is a defect in the atrioventricular septum of the heart which is a congenital defect.
4 | Atrioventricular canal defect | Yes | Yes | No | Yes | No | No | No | No | No | Yes | The term refers to a congenital heart defect that is present at birth (congenital).
5 | Atrioventricular canal defect | Yes | Yes | No | Yes | No | No | No | No | No | Yes | Congenital onset is specified in the definition.
1 | Cleft helix | No | No | No | Yes | No | No | No | No | No | Yes | Present at birth (congenital).
2 | Cleft helix | No | No | No | Yes | No | No | No | No | No | Yes | This is a congenital abnormality that affects the ear.
3 | Cleft helix | No | No | No | Yes | No | No | No | No | No | Yes | Cleft helix is a defect that is present since birth.
4 | Cleft helix | No | No | No | Yes | No | No | No | No | No | Yes | The term refers to a developmental defect of the helix of the ear that is present at birth (congenital).
5 | Cleft helix | No | No | No | Yes | No | No | No | No | No | Yes | Congenital onset is specified in the definition.
1 | Coloboma | Yes | No | No | Yes | Yes | Yes | No | No | No | Yes | Develops during fetal development and is present at birth (congenital).
2 | Coloboma | Yes | No | No | Yes | Yes | No | No | No | No | Yes | This is a developmental defect that is present at birth.
3 | Coloboma | Yes | Yes | No | Yes | Yes | Yes | No | No | No | Yes | Coloboma is a developmental defect that occurs during embryonic development.
4 | Coloboma | Yes | No | No | Yes | Yes | Yes | No | No | No | Yes | The term refers to a developmental defect of the eye that is present at birth (congenital).
5 | Coloboma | Yes | No | No | Yes | Yes | Yes | No | No | No | Yes | Congenital onset is specified in the definition.
1 | Cryptotia | No | No | Yes | Yes | No | No | No | No | No | Yes | Present at birth (congenital).
2 | Cryptotia | No | No | No | Yes | No | No | No | No | No | Yes | This is a congenital abnormality that affects the ear.
3 | Cryptotia | No | No | No | Yes | No | No | No | No | No | Yes | Cryptotia is present at birth.
4 | Cryptotia | No | No | No | Yes | No | No | No | No | No | Yes | The term refers to a developmental defect of the auricle of the ear that is present at birth (congenital).
5 | Cryptotia | No | No | No | Yes | No | No | No | No | No | Yes | Congenital onset is specified in the definition.
1 | Cupped ear | No | No | No | Yes | No | No | No | No | No | Yes | Present at birth (congenital).
2 | Cupped ear | No | No | No | Yes | No | No | No | No | No | Yes | This is a congenital abnormality that affects the ear.
3 | Cupped ear | No | No | No | Yes | No | No | No | No | No | Yes | This is a defect in ear folding which occurs during embryonic development.
4 | Cupped ear | No | No | No | Yes | No | No | No | No | No | Yes | The term refers to a developmental defect of the ear that is present at birth (congenital).
5 | Cupped ear | No | No | No | Yes | No | No | No | No | No | Yes | Congenital onset is specified in the definition.
1 | Exstrophy | Yes | Yes | Yes | Yes | No | No | No | No | No | Yes | Present at birth (congenital).
2 | Exstrophy | No | No | No | Yes | No | No | No | No | No | Yes | This is a developmental defect that is present at birth.
3 | Exstrophy | No | No | Yes | Yes | No | No | No | No | No | Yes | Exstrophy is a result of developmental defects in embryonic development.
4 | Exstrophy | No | No | Yes | Yes | No | No | No | No | No | Yes | The term refers to a developmental defect of the abdominal wall that is present at birth (congenital).
5 | Exstrophy | Yes | Yes | Yes | Yes | No | No | No | No | No | Yes | Congenital onset is specified in the definition.
<img width="865" alt="Screenshot 2023-03-27 at 11 14 53 am" src="https://user-images.githubusercontent.com/56632280/227913284-6f0a6968-813e-4d1b-a9d8-5c7b098fd41b.png">
## Attempt #4
Here I'm repeating attempt #2 with the addition of providing chatGPT with the definition of each congenital onset term.
results_list <- list()
for (j in 1:3) { res_indPheno_def <- lapply(seq_len(nrow(congenital_onset_dt)), function(i){ pheno <- congenital_onset_dt$preferredlabel[[i]] definition <- congenital_onset_dt$definition[[i]] question <- paste("Does", pheno, "-", definition, ", typically cause:", effects, "Does", pheno, "have congenital onset?", "You must give one-word yes or no answers and give a justification for why it does or doesn't have congenital onset.", "You must provide the output in .tsv format with columns:", table_columns) question <- gsub("\n", "", question) question <- gsub(". , typically", ", typically", question) all_res <- gptstudio::openai_create_chat_completion(prompt = question) choices <- fread(all_res[["choices"]]$message.content) }) results_list[[j]] <- res_indPheno_def }
list <- unlist(results_list, recursive = FALSE)
res_indPheno_def_dt <- data.table::rbindlist(list,fill = TRUE, use.names = TRUE, idcol = "iteration")
res_indPheno_def_dt <- res_individualPheno_dt[order(res_individualPheno_dt$phenotype), ]
Here is a subset of `res_indPheno_def_dt`.
iteration | phenotype | mental retardation | death | impaired mobility | physical malformations | blindness | sensory impairments | immunodeficiency | cancer | reduced fertility | congenital onset | justification | justification
-- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | --
5 | Atrioventricular canal defect | No | No | No | Yes | No | No | No | No | No | Yes | Cause is a defect of the atrioventricular septum which develops during fetal development, making it congenital. | NA
24 | Atrioventricular canal defect | No | No | No | Yes | No | No | No | No | No | Yes | Atrioventricular canal defect is a congenital heart defect, meaning it is present at birth. | NA
43 | Atrioventricular canal defect | No | Yes | No | Yes | No | No | No | No | No | Yes | Atrioventricular canal defect is a congenital heart defect that is present at birth. | NA
6 | Cleft helix | No | No | No | Yes | No | No | No | No | No | Yes | Cleft helix is a congenital malformation that occurs during fetal development. | NA
25 | Cleft helix | No | No | No | Yes | No | No | No | No | No | Yes | Cleft helix is a physical malformation that is present at birth and affects the ear. | NA
44 | Cleft helix | No | No | No | Yes | No | No | No | No | No | Yes | Cleft helix is a physical malformation of the ear that is present at birth, indicating a congenital onset. | NA
7 | Coloboma | No | No | No | Yes | Yes | No | No | No | No | Yes | Coloboma is a congenital condition as it results from incomplete closure of the optic fissure during embryonic development, which occurs during the early stages of fetal development. | NA
26 | Coloboma | No | No | No | Yes | Yes | Yes | No | No | No | Yes | Coloboma is a developmental defect that is present at birth, therefore it has a congenital onset. | NA
45 | Coloboma | no | no | no | yes | yes | yes | no | no | no | yes | It is a developmental defect, meaning it occurs during fetal development and is present at birth. | NA
8 | Cryptotia | No | No | No | Yes | No | No | No | No | No | Yes | Cryptotia is a congenital condition, meaning it is present at birth. It is caused by abnormal development of the ear during fetal development. | NA
27 | Cryptotia | No | No | No | Yes | No | No | No | No | No | Yes | Cryptotia is a congenital anomaly caused by abnormal development of the auricle in utero. | NA
46 | Cryptotia | No | No | No | Yes | No | No | No | No | No | Yes | Cryptotia is a congenital anomaly that develops during fetal growth and is present at birth. | NA
9 | Cupped ear | No | No | No | Yes | No | No | No | No | No | Yes | Cupped ear is a physical malformation that is present at birth, thus it has a congenital onset. | NA
28 | Cupped ear | No | No | No | Yes | No | No | No | No | No | Yes | Cupped ear is a physical malformation that is present at birth, indicating congenital onset. | NA
47 | Cupped ear | No | No | No | Yes | No | No | No | No | No | Yes | Cupped ear is a physical malformation that is present at birth and does not develop later in life. Therefore, it has a congenital onset. | NA
10 | Exstrophy | No | No | Yes | Yes | No | No | No | No | Yes | Yes | Exstrophy is a congenital birth defect that occurs during fetal development. | NA
29 | Exstrophy | No | No | Yes | Yes | No | No | No | No | Yes | Yes | Exstrophy is a congenital abnormality, present at birth. | NA
48 | Exstrophy | No | No | Yes | Yes | No | No | No | No | Yes | Yes | Exstrophy is a congenital condition that occurs during. | NA
@bschilder @NathanSkene
That prompt is not including the description of the phenotype is it?
Sent from Outlook for iOShttps://aka.ms/o0ukef
From: Kitty Murphy @.> Sent: Sunday, March 26, 2023 11:55:25 AM To: neurogenomics/RareDiseasePrioritisation @.> Cc: Skene, Nathan G @.>; Mention @.> Subject: Re: [neurogenomics/RareDiseasePrioritisation] Annotate diseases/phenotypes using chatGPT (Issue #19)
This email from @.*** originates from outside Imperial. Do not click on links and attachments unless you recognise the sender. If you trust the sender, add them to your safe senders listhttps://spam.ic.ac.uk/SpamConsole/Senders.aspx to disable email stamping for this address.
Annotating HPO phenotypes using chatGPT via gptstudio Set up
install.packages("gptstudio") library(gptstudio)
terms_dt = HPOExplorer::load_phenotype_to_genes(3) terms_cols = list(name="Phenotype", id="ID")
terms_dt_sub <.- unique(terms_dt[,unname(unlist(terms_cols)), with=FALSE])
Attempt #1https://github.com/neurogenomics/RareDiseasePrioritisation/issues/1
Here I'm using the congenital onset terms (without HPO ID) that were provided to us by Peter Robinson. Will also try:
congenital_onset <- "Syndactyly; Ventricular septal defect; Atrioventricular canal defect; Atrial septal defect; Abnormal connection of the cardiac segments; Fetal anomaly; Neural tube defect; Coloboma; Microtia; Cryptotia; Cupped ear; Cleft helix; Low-set ears; Synotia; Holoprosencephaly; Exstrophy; Abdominal wall defect; Abnormal lung lobation; Unilateral primary pulmonary dysgenesis"
effects <- "mental retardation, death, impaired mobility, physical malformations, blindness, sensory impairments, immunodeficiency, cancer, reduced fertility."
table_columns <- "phenotype, mental retardation, death, impaired mobility, physical malformations, blindness, sensory impairments, immunodeficiency, cancer, reduced fertility, congenital onset, jusitification."
question = paste("Do:", congenital_onset, ", typically cause:", effects, "Do they have congenital onset?", "You must give one-word yes or no answers and give a justification for why they do or don't have congenital onset.", "You must provide the output in .tsv format with columns:", table_columns) question <- gsub("\n", "", question)
n = 5 run_chatgpt <- function(q){ all_res <- gptstudio::openai_create_chat_completion(prompt = question) choices <- fread(all_res[["choices"]]$message.content) }
res_allPheno <- lapply(seq_len(n), function(x) run_chatgpt(1))
res_allPheno_dt <- data.table::rbindlist(res_list,fill = TRUE, use.names = TRUE, idcol = "iteration")
res_allPheno_dt <- res_allPheno_dt [order(res_allPheno_dt $phenotype), ]
Below is a subset of res_allPheno_dt. The answers chatGPT gives over iterations of the same prompt are not consistent e.g. look at mental retardation for coloboma. A coloboma is an area of missing tissue in your eye, and through a quick google search is not associated with mental retardation.
iteration phenotype mental retardation death impaired mobility physical malformations blindness sensory impairments immunodeficiency cancer reduced fertility congenital onset justification 1 Atrioventricular canal defect Yes Yes Yes Yes No No No No No Yes Congenital heart defect present at birth 2 Atrioventricular canal defect Yes, in some cases May lead to premature death no May lead to growth failure, fatigue or rapid breathing May lead to vision problems None None None No AV canal defect is present at birth and is a congenital condition. 3 Atrioventricular canal defect Yes Yes No Yes No No No No No Yes Atrioventricular canal defect is a congenital heart defect in which there is an opening in the center of the heart where the walls separating the heart chambers should be. 4 Atrioventricular canal defect Yes Possible None Physical malformations No No No No No Yes Congenital onset is typical of this phenotype as it is a result of abnormal development of the heart during fetal development. 5 Atrioventricular canal defect Yes Yes No Yes No No No No No Yes It is a congenital heart defect that is present at birth. 1 Cleft helix No No No Yes No No No No No Yes Congenital ear malformation present at birth 2 Cleft helix No None None May lead to physical malformations of the ear None None None None Yes Cleft helix is present at birth and is a congenital condition. 3 Cleft helix No No No Yes No No No No No Yes Cleft helix is a congenital anomaly characterized by a cleft or gap in the top part of the ear. 4 Cleft helix No None None Physical malformations No No No No No Yes Congenital onset is typical of this phenotype as it is a result of incomplete development of the ear during fetal development. 5 Cleft helix No No No Yes No No No No No Yes A cleft helix is a rare congenital malformation of the ear. 1 Coloboma Yes No No Yes Yes Yes No No No Yes Present at birth and can affect vision and eye structure 2 Coloboma No May lead to vision problems or blindness May depend on location on the body None May lead to vision problems or blindness May lead to hearing loss or deafness None None No Coloboma is present at birth and is a congenital condition. 3 Coloboma Yes No No Yes Yes Yes No No No Yes Coloboma is a congenital anomaly characterized by a gap or hole in one of the structures of the eye. 4 Coloboma No None None Physical malformations Possible Possible No No No Yes Congenital onset is typical of this phenotype as it is a result of incomplete fusion of the tissues that form the eye during fetal development. 5 Coloboma Yes No No Yes Yes No No No No Yes A coloboma is a birth defect that affects the eye. 1 Cryptotia No No No Yes No No No No No Yes Congenital ear malformation present at birth 2 Cryptotia No None None May lead to physical malformations of the ear None None None None Yes Cryptotia is present at birth and is a congenital condition. 3 Cryptotia No No No Yes No No No No No Yes Cryptotia is a congenital anomaly characterized by a hidden ear that is partially or completely covered by skin. 4 Cryptotia No None None Physical malformations No No No No No Yes Congenital onset is typical of this phenotype as it is a result of abnormal development of the ear during fetal development. 5 Cryptotia No No No Yes No No No No No Yes Cryptotia is a congenital ear deformity. 1 Cupped ear No No No Yes No No No No No Yes Congenital ear malformation present at birth 2 Cupped ear No None None May lead to physical malformations of the ear None None None None Yes Cupped ear is present at birth and is a congenital condition. 3 Cupped ear No No No Yes No No No No No Yes Cupped ear is a congenital anomaly characterized by an ear that is shaped like a cup and protrudes outward from the side of the head. 4 Cupped ear No None None Physical malformations No No No No No Yes Congenital onset is typical of this phenotype as it is a result of abnormal development of the ear during fetal development. 5 Cupped ear No No No Yes No No No No No Yes A cupped ear is a congenital malformation. 1 Exstrophy Yes No Yes Yes No No No No No Yes Present at birth and affects bladder and pelvic development 2 Exstrophy No None None May lead to physical malformations of the abdominal wall or pelvic organs None None None May lead to reduced fertility Yes Exstrophy is present at birth and is a congenital condition. 3 Exstrophy Yes No Yes Yes No No No No No Yes Exstrophy is a congenital anomaly characterized by a defect in the abdominal wall or bladder. 4 Exstrophy No None None Physical malformations No No No No No Yes Congenital onset is typical of this phenotype as it is a result of abnormal development of the abdominal wall during fetal development. 5 Exstrophy Yes No Yes Yes No No No No No Yes Exstrophy is a congenital abnormality where the bladd Attempt #2https://github.com/neurogenomics/RareDiseasePrioritisation/issues/2
What if I run the prompt one phenotype at a time, with 3 iterations?
congenital_onset_split <- as.list(strsplit(congenital_onset, "; ")[[1]])
results_list <- list()
for (j in 1:3) { res_individualPheno <- lapply(seq_len(length(congenital_onset_split)), function(i){ pheno <- congenital_onset_split[[i]] question = paste("Does", pheno, "typically cause:", effects, "Does", pheno, "have congenital onset?", "You must give one-word yes or no answers and give a justification for why it does or doesn't have congenital onset.", "You must provide the output in .tsv format with columns:", table_columns) question <- gsub("\n", "", question) print(question) all_res <- gptstudio::openai_create_chat_completion(prompt = question) choices <- fread(all_res[["choices"]]$message.content) return(choices) }) results_list[[j]] <- res_individualPheno_list # store the result in the list }
list <- unlist(res_individualPheno_list, recursive = FALSE)
res_individualPheno_dt <- data.table::rbindlist(list,fill = TRUE, use.names = TRUE, idcol = "iteration")
res_individualPheno_dt <- res_individualPheno_dt[order(res_individualPheno_dt$phenotype), ]
Below is a subset of res_individualPheno_dt, I've shown the same phenotypes as for res_allPheno_dt for comparison. There seems to be more consistency across the iterations when you run chatgpt on each phenotype individually.
phenotype mental retardation death impaired mobility physical malformations blindness sensory impairments immunodeficiency cancer reduced fertility congenital onset justification Atrioventricular canal defect no no no yes no no no no no yes NA Atrioventricular canal defect No No No Yes No No No No No Yes NA Atrioventricular canal defect No No No Yes No No No No No Yes NA Cleft helix No No No Yes No No No No No Yes NA Cleft helix No No No Yes No No No No No Yes NA Cleft helix No No No Yes No No No No No Yes NA Coloboma No No No Yes Yes Yes No No No Yes NA Coloboma No No No Yes Yes Yes No No No Yes NA Coloboma no no no yes yes yes no no no yes NA Cryptotia No No No Yes No No No No No Yes NA Cryptotia No No No Yes No No No No No Yes NA Cryptotia No No No Yes No No No No No Yes NA Cupped ear No No No Yes No No No No No Yes NA Cupped ear no no no yes no no no no no yes NA Cupped ear No No No Yes No No No No No Yes NA Exstrophy No No Yes Yes No No No No Yes Yes NA Exstrophy No No Yes Yes No No No Yes Yes Yes NA Exstrophy No No Yes Yes No No No No Yes Yes NA
@bschilderhttps://github.com/bschilder @NathanSkenehttps://github.com/NathanSkene
— Reply to this email directly, view it on GitHubhttps://github.com/neurogenomics/RareDiseasePrioritisation/issues/19#issuecomment-1484059859, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AH5ZPE5L3RZIYKPKXP4QW3TW6AVC3ANCNFSM6AAAAAAWBOOU2U. You are receiving this because you were mentioned.Message ID: @.***>
Nice progress @KittyMurphy . That's interesting about the responses being more consistent when provided individually. Wondering if this has to with informational overload like we were discussing before. Might be an aspect of chatGPT that other people have noticed and documented.
One thing that would be helpful is to come up with a function that computes consistently scores for each metric. That will give us at least some quantitative metric of performance (tho not exactly the ground truth). Something like:
dat=xlsx::read.xlsx("~/Downloads/annot.xlsx",1)
avg <- dplyr::group_by(dat, phenotype) |> dplyr::summarise( mental.retardation_consistency=1/length(unique(mental.retardation)))
avg
After computing the within phenotype consistency, you can compute mean consistency:
mean(avg$mental.retardation_consistency)
# 0.75
That prompt is not including the description of the phenotype is it?
@NathanSkene I believe this is only providing the chatGPT with the name of the phenotype, not the full description of it. Thus, any other information about the disease is being pulled from the LLM itself.
Good idea to get some stats on it. Could also use scoring to compare ChatGPt3 vs 4 consistency: expect some folks will be interested.
Including the HPO description might help it get a more consistent understanding of what the phenotype is. Brian, do you know how the descriptions can be accessed programmatically?
Sent from Outlook for iOShttps://aka.ms/o0ukef
From: Brian M. Schilder @.> Sent: Sunday, March 26, 2023 3:53:41 PM To: neurogenomics/RareDiseasePrioritisation @.> Cc: Skene, Nathan G @.>; Mention @.> Subject: Re: [neurogenomics/RareDiseasePrioritisation] Annotate diseases/phenotypes using chatGPT (Issue #19)
This email from @.*** originates from outside Imperial. Do not click on links and attachments unless you recognise the sender. If you trust the sender, add them to your safe senders listhttps://spam.ic.ac.uk/SpamConsole/Senders.aspx to disable email stamping for this address.
Nice progress @KittyMurphyhttps://github.com/KittyMurphy . That's interesting about the responses being more consistent when provided individually. Wondering if this has to with informational overload like we were discussing before. Might be an aspect of chatGPT that other people have noticed and documented.
One thing that would be helpful is to come up with a function that computes consistently scores for each metric. That will give us at least some quantitative metric of performance (tho not exactly the ground truth). Something like:
dat=xlsx::read.xlsx("~/Downloads/annot.xlsx",1) dplyr::group_by(dat, phenotype) |> dplyr::summarise( mental.retardation_consistency=1/length(unique(mental.retardation)))
[Screenshot 2023-03-26 at 14 34 57]https://user-images.githubusercontent.com/34280215/227779366-f8ee8286-30af-486f-b39f-21d3c6ce5767.png
That prompt is not including the description of the phenotype is it?
@NathanSkenehttps://github.com/NathanSkene I believe this is only providing the chatGPT with the name of the phenotype, not the full description of it. Thus, any other information about the disease is being pulled from the LLM itself.
— Reply to this email directly, view it on GitHubhttps://github.com/neurogenomics/RareDiseasePrioritisation/issues/19#issuecomment-1484120736, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AH5ZPE7U7FACDH2TKEAWITTW6BJ7LANCNFSM6AAAAAAWBOOU2U. You are receiving this because you were mentioned.Message ID: @.***>
Already working on adding the description, @bschilder I assume the best way to get this is to use the definition column in HPOExplorer::make_phenos_dataframe?
Already working on adding the description, @bschilder I assume the best way to get this is to use the definition column in HPOExplorer::make_phenos_dataframe?
Yeah, that'll work. Or the subfunction which is more direct:
HPOExplorer::add_hpo_definition()
The current prompts do not include a statement for "Do not consider indirect effects". Would be worth adding this in and seeing if it makes any difference.
I tried out AutoGPT to see if this might be a useful avenue. Here’s what I learned:
Here is my favorite example of how AutoGPT can be very lazy 😅
I have now performed a trial run to annotate phenotypes using chat gpt via selenium. Initially we asked gpt to provide the output in .tsv format but I had difficulty trying to extract this from the chat interface into python. To overcome this, I asked gpt to provide the output as python code that I could then run to generate a data frame. @bschilder noted that earlier versions of gpt could sometimes be lazy when asking for code.
Here is a prompt example: "I need to annotate phenotypes as to whether they typically cause: intellectual disability, death, impaired mobility, physical malformations, blindness, sensory impairments, immunodeficiency, cancer, reduced fertility? Do they have congenital onset? You must give one-word yes or no answers. Do not consider indirect effects. You must provide the output in python code as a data frame called df with columns: phenotype, intellectual_disability, death, impaired_mobility, physical_malformations, blindness, sensory_impairments, immunodeficiency, cancer, reduced_fertility, congenital_onset, justification. These are the phenotypes: Abnormality of body height; Multicystic kidney dysplasia; Autosomal dominant inheritance; Autosomal recessive inheritance; Abnormal morphology of female internal genitalia; Functional abnormality of the bladder; Recurrent urinary tract infections; Neurogenic bladder; Urinary urgency; Hypoplasia of the uterus; Abnormality of the bladder; Bladder diverticulum"
Here is the trial run using ~100 phenotypes (note, there are ~200 because I think I appended the results twice by mistake): annot_HPO_gpt_test.csv
@NathanSkene noted that the phenotype 'Azoospermia' is not being annotated as reducing fertility. This is worrying as upon a literature search of this phenotype: "Azoospermia is the complete absence of spermatozoa in the ejaculate. It is the most severe and one of the leading causes of male infertility. The exact pathophysiology of azoospermia is not always known. Azoospermia can be due to pre-testicular, testicular, and post-testicular causes."
Next, I want to:
Thanks @KittyMurphy !
A couple of other ideas for reducing token usage (tho whether this helps will depend on how OpenAI counts 'tokens', which i'm still not totally clear on):
All of the following annotation validation procedures described below can be rerun with any new annotations using the new internal function: HPOExplorer:::check_annot_gpt
https://github.com/neurogenomics/HPOExplorer/blob/master/R/check_annot_gpt.R
Check whether chatGPT hasn't modified the phenotype names such that we can't link it back to the input HPO terms.
d <- data.table::fread(path, key = "Phenotype")
annot <- HPOExplorer::load_phenotype_to_genes()
d$Phenotype[!d$Phenotype %in% annot$Phenotype]
# character(0)
✅ All phenotypes in HPO gene annotations file verbatim.
For phenotype that chatGPT annotated more than once, how consistent are the Y/N annotations it gave for each?
nm <- names(d)[!names(d) %in% c("Phenotype","Justification")]
d_mean <- d[,lapply(.SD,function(x){mean(x=="Yes")}),.SDcols=nm, by="Phenotype"]
d_consist <- lapply(d_mean[,-1], function(x)sum(x%in%c(0,1)/nrow(d_mean)))
d_consist
$Intellectual_Disability
[1] 1
$Death
[1] 1
$Impaired_Mobility
[1] 1
$Physical_Malformations
[1] 1
$Blindness
[1] 1
$Sensory_Impairments
[1] 1
$Immunodeficiency
[1] 1
$Cancer
[1] 1
$Reduced_Fertility
[1] 0.7708333
$Congenital_Onset
[1] 1
mean(unlist(d_consist))
# 0.9770833
✅ At least In this small subsampling, 9/10 annotation columns are 100% consistent across chatGPT runs. This results in an average consistency score of 97.7% across all annotations. "Reduced_Fertility" is one to look out for, as it does not appear to always provide the same annotation here (77%, which may seem not too bad but remember that baseline is 50% as the options are binary).
As some of these phenotypes belong to specific branches of the HPO that should guarantee have a particular annotation (e.g. all forms of blindness phenotypes cause Blindness ('Yes'), we can use this information to validate the chatGPT-provided annotations.
While we can confirm annotations that we would expect (true positives vs. false negatives), this doesn't really let us definitively says whether some phenotypes do NOT cause a given condition such as blindness (true negatives).
d$HPO_ID <- harmonise_phenotypes(phenotypes = d$Phenotype,
as_hpo_ids = TRUE)
## Find matching HPO branches
hpo <- get_hpo()
queries <- list(
Intellectual_Disability=c("intellectual disability"),
Impaired_Mobility=c("Abnormal central motor function",
"Abnormality of movement"),
Physical_Malformations=c("malformation","morphology"),
Blindness=c("^blindness"),
Sensory_Impairments=c("Abnormality of vision",
"Abnormality of the sense of smell",
"Abnormality of taste sensation",
"Somatic sensory dysfunction",
"Hearing abnormality"
),
Immunodeficiency=c("Immunodeficiency"),
Cancer=c("Neoplasm","Cancer"),
Reduced_Fertility=c("Decreased fertility")
)
tiers <- lapply(queries, function(q){
terms <- grep(paste(q,collapse = "|"),
hpo$name,
ignore.case = TRUE, value = TRUE)
ontologyIndex::get_descendants(ontology = hpo,
roots = names(terms),
exclude_roots = FALSE) |>
unique()
})
annot_check <- lapply(seq_len(nrow(d)), function(i){
r <- d[i,]
cbind(
r[,c("Phenotype","HPO_ID")],
lapply(stats::setNames(names(tiers),names(tiers)),
function(x){
if(r$HPO_ID %in% tiers[[x]]){
r[,x,with=FALSE][[1]]=="Yes"
} else {
NA
}
}) |> data.table::as.data.table()
)
}) |> data.table::rbindlist()
### Number of rows where annotation is NA
missing_rate <- sapply(
annot_check[,names(tiers),with=FALSE],
function(x){sum(is.na(x))/length(x)})
missing_rate
Intellectual_Disability Impaired_Mobility Physical_Malformations
1.0000000 1.0000000 0.4558824
Blindness Sensory_Impairments Immunodeficiency
1.0000000 1.0000000 1.0000000
Cancer Reduced_Fertility
0.9901961 0.9607843
### Number of rows where the annotation was checkable and TRUE
true_pos_rate <- sapply(annot_check[,names(tiers),with=FALSE], function(x){sum(na.omit(x)==TRUE)/length(na.omit(x))})
true_pos_rate
Intellectual_Disability Impaired_Mobility Physical_Malformations
NaN NaN 0.5765766
Blindness Sensory_Impairments Immunodeficiency
NaN NaN NaN
Cancer Reduced_Fertility
1.0000000 0.5000000
### Number of rows where the annotation was checkable and FALSE
false_neg_rate <- sapply(annot_check[,names(tiers),with=FALSE], function(x){sum(na.omit(x)==FALSE)/length(na.omit(x))})
false_neg_rate
Intellectual_Disability Impaired_Mobility Physical_Malformations
NaN NaN 0.4234234
Blindness Sensory_Impairments Immunodeficiency
NaN NaN NaN
Cancer Reduced_Fertility
0.0000000 0.5000000
I have since updated the prompt twice.
Example prompt 1.1: I need to annotate phenotypes as to whether they typically cause: intellectual disability, death, impaired mobility, physical malformations, blindness, sensory impairments, immunodeficiency, cancer, reduced fertility? Do they always have congenital onset? You must give one-word yes or no answers. Do not consider indirect effects. You must provide the output in python code as a data frame called df with columns: phenotype, intellectual_disability, death, impaired_mobility, physical_malformations, blindness, sensory_impairments, immunodeficiency, cancer, reduced_fertility, congenital_onset. Also add justification columns for each outcome. These are the phenotypes: Recurrent urinary tract infections; Neurogenic bladder; Urinary urgency
Here are the results for ~500 phenotypes: gpt_hpo_annotations.csv. The issue here was that we were getting non yes or no answers for some of the phenotypic outcomes e.g. 'can be', 'may be'. To get around this, we decided to add a scale for the phenotypic outcomes, so instead of yes or no answers we ask chat gpt to answer using a scale of: never, rarely, often, always. Due to limited token usage we had to drop the number of phenotypes in each prompt to two.
Example prompt 1.2: I need to annotate phenotypes as to whether they typically cause: intellectual disability, death, impaired mobility, physical malformations, blindness, sensory impairments, immunodeficiency, cancer, reduced fertility? Do they have congenital onset? To answer, use a severity scale of: never, rarely, often, always. Do not consider indirect effects. You must provide the output in python code as a data frame called df with columns: phenotype, intellectual_disability, death, impaired_mobility, physical_malformations, blindness, sensory_impairments, immunodeficiency, cancer, reduced_fertility, congenital_onset. Also add justification columns for each outcome. These are the phenotypes: Urinary urgency; Hypoplasia of the uterus
Here are the results so far: gpt_hpo_annotations_scale.csv
Currently waiting for help from Eugene to get this set up on a remote machine so that it can run 24/7, and it will probably take ~2 weeks.
@KittyMurphy I'm looking into some resources that might be helpful:
ChatGPT File uploader (google chrome extension) https://chrome.google.com/webstore/detail/chatgpt-file-uploader-ext/becfinhbfclcgokjlobojlnldbfillpf/
Bing Chat: Microsoft's iteration of ChatGPT: https://www.bing.com/
annot=HPOExplorer::load_phenotype_to_genes()
length(unique(annot$hpo_name))
# [1] 10969
@KittyMurphy is running the last of these now.
hpo=HPOExplorer::get_hpo()
> length(unique(hpo$name))
# [1] 18057
I've actually been using the below code to get the phenotypes:
annot <- HPOExplorer::make_phenos_dataframe()
length(unique(annot$hpo_id))
[1] 10954
I'll make sure I run the remaining 15 phenotypes that are called with HPOExplorer::load_phenotype_to_genes()
but just wanted to flag the discrepancy between the two.
@KittyMurphy make_phenos_dataframe
calls load_phenotype_to_genes
to get the data, so they should be the same (unless somehow certain phenotypes get filtered in the former function).
https://github.com/neurogenomics/HPOExplorer/blob/master/R/make_phenos_dataframe.R
Could you check whether this discrepancy stems from :
HPOExplorer
(checked boxes indicate at least an initial attempt has been made)
Annotations
Models
Related
Some of my initial attempts are documented within this R package: https://github.com/neurogenomics/gptPhD
@KittyMurphy once you have a chance please report your progress here. I'll do the same.