monarch-initiative / phenoCompare

Phenotype Compare
BSD 3-Clause "New" or "Revised" License
1 stars 1 forks source link

similarity calculation for sets of phenotypes #22

Open hannahblau opened 5 years ago

hannahblau commented 5 years ago

What similarity function to implement for comparing sets of HPO terms in the gsppc (gene set phenotypic profile comparison) R notebook? The first version used Jaccard similarity. Could re-implement one of the similarity functions from HPOSim: An R Package for Phenotypic Similarity Measure and Enrichment Analysis Based on the Human Phenotype Ontology Yue Deng, Lin Gao, Bingbo Wang, Xingli Guo PLOS ONE, 9 February 2015

However, I don't think this paper has a good metric for comparing sets qua sets. They propose computing the pairwise similarity of every pair of elements (one from the first set, one from the second set) and then choosing the max similarity or the mean or...

hannahblau commented 5 years ago

Another choice would be to redefine what sets of phenotypes we choose to compare. The first version included all phenotypes reported in case studies plus all the HPO ancestors of those phenotypes. Maybe we want to keep only the original phenotypes, or limit how far up the hierarchy we are willing to go. Only add the parents of the original term to the set of phenotypes? Parents and grandparents?

pnrobinson commented 5 years ago

After looking at the new annotations, I think that we could do the following

  1. Define the target set as the union of all of the terms in the GPI disease definitions (i.e., the main HPO files).
  2. Take the information content of each of the terms based on their frequency in the entire HPO database
  3. For each of the terms in the GPI-target-protein diseases, find the best match (most informative common ancestor) in the first set. Record its information content, and take the sum of the ICs for all terms. Then, take the sum of the ICs for all of the GPI-target-protein diseases.
  4. For the randomization, we could either define the target set randomly (i.e., use random diseases instead of GPI diseases) or the other set randomly.
  5. This way of comparing the sets will take the specifity of the matches into account much more than the Jaccard approach.
LCCarmody commented 5 years ago

Random interjection, but I think this may have an effect on the results. Is there an argument to remove certain annotations for comparison? Specifically, I have noticed that "Autosomal recessive" or "Autosomal dominant" or "early onset" are listed as annotations. While this is important information for the disease, is it reasonable to use these as phenotypes?

hannahblau commented 5 years ago

I noticed those "phenotypes" myself but did not know what to do with them. Currently there is no filtering on the phenotypes, they are drawn from the annotations files available on http://compbio.charite.de/jenkins/job/hpo.annotations.monthly/lastSuccessfulBuild/artifact/annotation/ and I did not think I should second-guess what's listed as an annotation even though some of them are really modes of inheritance. I could filter these out if I locate the common ancestor of the modes of inheritance and then I could toss out any child of that HPO term. It would be an extra step but I think not a difficult one.

LCCarmody commented 5 years ago

Depending on if Peter agrees, I think it would be a good idea to filter out any annotations that are not under "Phenotypic Abnormality". I think that would eliminate all of these annotations. @pnrobinson What do you think?

pnrobinson commented 5 years ago

Leigh's idea is good, we should just use terms under "Phenotypic Abnormality".

hannahblau commented 5 years ago

per discussion with @pnrobinson : format col1=OMIM id of a disease col2=EntrezGene id of the corresponding disease gene (there may be multiple entries for one disease( col3=shows whether it is a disease ("phenotype") or elsewise (e.g., gene). We are just interested in the phenotype rows:

103780 125 phenotype

medgen ncbi website private final static String MIM2GENE_MEDGEN_URL = "ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/mim2gene_medgen";

parse the lists of GPI and GPI target genes and use mim2gene_medgen file to get the corresponding disease ids. Use OMIM website to check correctness

  1. Get the phenotype.hpoa from here http://compbio.charite.de/jenkins/job/hpo.annotations.2018/
  2. Get the latest hp.obo from here https://hpo.jax.org/app/download/ontology
hannahblau commented 5 years ago

@pnrobinson @LCCarmody Three genes from the GPI pathway/anchored groups have associated OMIM entries listed in the _mim2genemedgen file, but the OMIM ids do not appear in phenotype.hpoa. They are:

Gene Group Disease
PIGS (ENTREZ:94005) Pathway OMIM:618143
ART4 (ENTREZ:420) Anchored OMIM:616060
SEMA7A (ENTREZ:8482) Anchored OMIM:614745

Here are the corresponding lines in the _mim2genemedgen file:

618143            94005  phenotype       GeneMap       NULL    -
616060            420      phenotype       GeneMap       C1292294        nondisease
614745            8482    phenotype       GeneMap       C3553633        nondisease

Decided to ignore any line of _mim2genemedgen that contains "nondisease" in sixth column. If we ever expand our interest in _mim2genemedgen beyond the GPI genes, might have to worry about other values one can find in the sixth column:

$cut -f6 mim2gene_medgen | sort | uniq
-
Comment
nondisease
nondisease; QTL 2
nondisease; QTL 2; susceptibility
nondisease; nondisease; QTL 2
nondisease; question
nondisease; susceptibility
question
susceptibility
susceptibility; QTL 1
susceptibility; modifier
susceptibility; modifier; question
susceptibility; question
susceptibility; somatic