Closed bschilder closed 1 year ago
MultiEWE::prioritise_targets
.Suggestions by @NathanSkene
If we want to make the jump to suggesting the confidence scores extend to phenotypes, then it would be good to show that higher confidence genes have higher confidence associations with their celltypes than the lower confidence genes. we'll have less high confidence genes than low confidence ones. so we'd need to do the comparison using equal numbers of low confidence ones We could run EWCE on subsampled low confidence genes (for one trait), and test whether the average cell type enrichment is lower for lower confidence genes? Or we could try some kind of genome/phenome wide test for whether confidence ~ specificity
[x] Add these evidence scores as a new argument in
MultiEWE::prioritise_targets
. Done.
[x] Test out how many associations remain after filtering at each evidence level. The numbers will chance once we change the prior filtering steps based on phenotype metadata (currently being done by @KittyMurphy via chatGPT), but i can at least begin to get a sense of how many associations are left after filtering at each evidence level.
phenos <- load_phenotype_to_genes()
phenos2 <- add_evidence(phenos = phenos,
default_score = NULL)
testthat::expect_true(
all(c("evidence_score_min",
"evidence_score_max",
"evidence_score_mean") %in% names(phenos2))
)
rep <- lapply(c(NA,seq(0,6)), function(x){
vars <- c("evidence_score_min","evidence_score_max","evidence_score_mean")
lapply(stats::setNames(vars,vars),
function(v){
p <- if(is.na(x)){
phenos2
} else {
phenos2[get(v)>=x,]
}
data.table::data.table(
evidence_score_filter=toString(x),
rows=nrow(p),
diseases=length(unique(p$DatabaseID)),
phenotypes=length(unique(p$HPO_ID)),
genes=length(unique(p$Gene))
)
}) |> data.table::rbindlist(use.names = TRUE,
idcol = "evidence_score_variable")
}) |> data.table::rbindlist()
rep_melt <- data.table::melt.data.table(
rep,
measure.vars = names(rep)[-seq_len(2)],
variable.name = "metric",
value.name = "count")
rep_melt$evidence_score_filter <- factor(rep_melt$evidence_score_filter,
levels = unique(rep_melt$evidence_score_filter),
ordered = TRUE)
ggplot2::ggplot(rep_melt,
ggplot2::aes(x=evidence_score_filter,
y=count,
shape=evidence_score_variable,
color=evidence_score_variable)) +
ggplot2::geom_smooth(se = FALSE, na.rm = FALSE) +
ggplot2::geom_point(size=3, alpha=.8, na.rm = FALSE) +
ggplot2::facet_wrap(facets = metric~.,
scales = "free") +
ggplot2::theme_bw()
Filtering HPO annotations at each evidence level (or greater) shows that there is a nice smooth decrease in the number of associations/diseases/phenotypes/genes (as opposed to the vast majority of the data dropping off precipitously at some threshold, which would indicate that the evidence annotations are skewed).
At the most stringent evidence level (6) we still have quite a few phenotypes/diseases/genes, which I think bodes well for prioritising gene therapy targets (we'll have to wait until we have the chatGPT annotations to see how this bears out in the prioritisation pipeline).
- [x] Test for relationships between specificity and evidence strength per gene per cell-type.
We don't have the new set of enrichment results yet (still waiting for the HPO annotations to be fixed https://github.com/neurogenomics/HPOExplorer/issues/37).
But in the meantime, I can at least begin to look into this using our old results.
We don't have the new set of enrichment results yet (still waiting for the HPO annotations to be fixed https://github.com/neurogenomics/HPOExplorer/issues/37). But in the meantime, I can at least begin to look into this using our old results.
Investigated the relationships between evidence score and specificity (within each celltype) using a genome/phenome-wide strategy using all of our results.
See here for full Rmarkdown report code. I chose not to render it as an html because the data is rather large, instead using it more as a way of documenting the code for these analyses: https://github.com/neurogenomics/RareDiseasePrioritisation/blob/master/reports/evidence_scores.Rmd
Our phenotype EWCE results are obviously at the level of phenotype, while the evidence scores are purely at the disease level. This means you have multiple evidence scores within a given celltype-phenotype association.
res <- MultiEWCE::load_example_results("Descartes_All_Results_extras.rds")
res <- HPOExplorer::add_genes(res, allow.cartesion = TRUE)
res <- HPOExplorer::add_evidence(phenos = res)
ctd_out <- MultiEWCE::add_ctd(results = res)
res <- ctd_out$results
df <- res[,list(specificity=mean(specificity),
mean_exp=mean(mean_exp),
evidence_score_min=mean(evidence_score_min),
evidence_score_mean=mean(evidence_score_mean),
evidence_score_max=mean(evidence_score_max)),
by=c("HPO_ID","CellType")]
I computed mean evidence score within each celltype-phenotype combination and checked for correlation, but it seems there is no relationship:
> cor(df$evidence_score_mean,
+ df$mean_exp)
[1] -0.02194124
> cor(df$evidence_score_mean,
+ df$specificity)
[1] -0.01300368
This is confirmed when visualizing as a scatter plot:
library(ggplot2)
ggplot(df, aes(x=evidence_score_mean,y =specificity)) +
geom_point(alpha=.1) +
# geom_density2d_filled(adjust=.5) +
geom_smooth()
And again after logging both axes (to try and improve visibility):
library(ggplot2)
ggplot(df, aes(x=log(evidence_score_mean),y =log(specificity))) +
geom_point(alpha=.1) +
# geom_density2d_filled(adjust=.5) +
geom_smooth()
We could also try a different strategy, and look at the symptom enrichment results (Note: we only have the Fisher's exact enrichment test results atm, waiting on the updated HPO gene annotations to rerun with EWCE).
res <- MultiEWCE::load_example_results("Descartes_All_Results_extras.symptoms.full_join.rds")
res <- HPOExplorer::phenos_to_granges(phenos = res,
as_datatable = TRUE)
# res <- HPOExplorer::add_genes(res, allow.cartesion = T)
res <- HPOExplorer::add_evidence(phenos = res)
ctd_out <- MultiEWCE::add_ctd(results = res)
res <- ctd_out$results
Then, I aggregate the celltype enrichment results.
df <- res[,list(specificity=mean(specificity),
mean_exp=mean(mean_exp),
evidence_score_min=mean(evidence_score_min),
evidence_score_mean=mean(evidence_score_mean),
evidence_score_max=mean(evidence_score_max)),
by=c("HPO_ID","CellType")]
Again, seems to be no relationships between specificity/expression and evidence scores.
> cor(df$evidence_score_mean,
+ df$mean_exp)
[1] -0.006639101
> cor(df$evidence_score_mean,
+ df$specificity)
[1] -0.01384008
Based on the analyses so far, I can't find any evidence of a relationship between celltype specificity (for a given celltype that is enriched in the set of phenotype that make up a disease) and gene-disease association evidence scores.
While a good first pass, the above strategies may or may not be the best to investigate this question of whether there's a relationship between specificity and evidence scores. In both approaches, there is a mismatch between the level at which the evidence scores are annotated (gene-disease) and our enrichment tests (phenotype-level and symptom-level).
I suppose we could do EWCE tests at the disease level and see whether the genes driving the enrichment (ie have high specificity in that particular celltype) also have high evidence scores. But I'm not sure this would really help us to make any sort of argument about our phenotype-level results.
We could run EWCE on subsampled low confidence genes (for one trait), and test whether the average cell type enrichment is lower for lower confidence genes?
This previous suggestion by @NathanSkene might be another option, but I'll have to think a bit more about I would implement this.
The issue
The HPO includes lots of gene-disease/phenotype associations. But these do not contain any information about the current strength of the evidence supporting these associations.
A solution
Peter Robinson alerted me to GenCC which does exactly this. These were produced through manual annotation by experts across academia, medicine, and industry via systematic Delphi studies (which are the gold-standard in these kinds of meta-analyses): https://search.thegencc.org/download
I've gone ahead and integrated this information into the HPO annotations via a new function:
HPOExplorer::add_evidence
Assessments
Classification types
These evidence scores can be added by multiple users/companies, so I codified the evidence classifications into numeric scores and aggregated them per gene-disease association.
We can see the distribution of evidence scores across all HPO annotations:
Most annotations are "Supportive". I can't find an explanation of this term on their site, but it is classified as being just below "Moderate". https://search.thegencc.org/
HPO coverage
The GenCC annotation cover a surprisingly large proportion of the HPO annotations. This means they can be quite useful to us for filtering the potential gene therapy targets:
Limitations
Note, since these evidence scores are only provided at the disease-level, we can't assign these scores to specific phenotypes associated with those diseases. This means having to assign the same evidence score to each gene-phenotype association within a given gene-disease association.
But since most phenotypes are associated with a disease via a couple of genes, the resulting number of combinations isn't too large.This combinatorics issue was actually due to some diseases/gene symbols having multiple unique entries in the following columns: "disease_curie", "disease_original_curie", "disease_title", "gene_curie"
There were also a couple of duplicate entries per "DiseaseID" in the
annot <- load_phenotype_to_genes(3)
andadd_disease()
.Removing these columns from the
agg_by
arg resolving the issue. Still though, these evidence scores are limited in the sense that they are not phenotype-specific (only disease-specific).