How can we get data on the prevalence of a particular phenotype?

bschilder commented 1 year ago

"Frequency" is a column in the phenotype-to-disease annotations file from HPO. https://hpo-annotation-qc.readthedocs.io/en/latest/annotationFormat.html

Frequency: There are three allowed options for this field. (A) A term-id from the HPO-sub-ontology below the term “Frequency” (HP:0040279). (since December 2016 ; before was a mixture of values). The terms for frequency are in alignment with Orphanet. * (B) A count of patients affected within a cohort. For instance, 7/13 would indicate that 7 of the 13 patients with the specified disease were found to have the phenotypic abnormality referred to by the HPO term in question in the study refered to by the DB_Reference; (C) A percentage value such as 17%.

Here's some example values: Screenshot 2023-02-07 at 11 39 39

But this has more to do with how frequently each patient with a given disease also has the HPO phenotype. To get overall prevalence in the wider population, we'd have to gather data from another resource.

bschilder commented 1 year ago

I think "FrequencyHPO" means something a bit different in this file:

 g2p <- HPOExplorer::load_phenotype_to_genes("genes_to_phenotype.txt")

After translating this column from ID to names, it shows: Screenshot 2023-02-07 at 12 12 08

Can't find much documentation on this on HPO's site, but did find this: https://hpo-annotation-qc.readthedocs.io/en/latest/smallfile.html

frequency. This column can be one of three formats: A valid HPO term from the frequency subontology, a fractional expression m/n (e.g., 4/7 meaning that 4 of 7 individuals in the cited study had the disease and the feature in question, while the feature was ruled out in the remaining 3 of 7 individuals); or a percentage value such as 47%. This column may be empty.

These frequencies appear to be gene-specific, as aggregating them by Phenotype shows multiple frequencies per Phenotype. In other words, I'm interpreting these frequencies as "how frequently is a mutation in this gene associated with this phenotype?"

So this is still useful for prioritising putative gene targets, as gene with mutations that occur is a larger % of the disease population will have a bigger impact (and are more financially feasible for pharma companies). Screenshot 2023-02-07 at 12 15 31

I've parsed this further to get frequency ranges.

Screenshot 2023-02-07 at 12 39 24

I can also aggregate the gene frequencies to phenotype-level. Though not sure exactly what this would tell us. Perhaps something like, "% of time that any known genes are associated with the phenotype" Screenshot 2023-02-07 at 12 42 14

bschilder commented 1 year ago

Another way to get phenotype prevalence is from the HPO annotations file:

annot <- load_phenotype_to_genes("phenotype.hpoa")

Screenshot 2023-02-07 at 14 21 30

In general, this tells us how frequently a phenotype occurs within a cohort of individuals with a given disease. So if we compute the mean frequency per phenotype, it tells us "within all known diseases where this phenotype occurs, what is the average frequency of this phenotype?"

This gives us a roughly normal distribution of phenotype frequency within diseases.

Screenshot 2023-02-07 at 14 30 59

bschilder commented 1 year ago

I've stored the parsed phenotype frequencies as a built-in dataset to HPOExplorer to save time: hpo_frequency

Also, I've added 2 new functions to add frequency-related info to a given dataframe of HPO phenotypes:

add_gene_frequency: frequency of genes within a given phenotype.
add_pheno_frequency: frequency of a phenotype within diseases.

neurogenomics / RareDiseasePrioritisation

How can we get data on the prevalence of a particular phenotype? #4