Open hannahblau opened 5 years ago
Another choice would be to redefine what sets of phenotypes we choose to compare. The first version included all phenotypes reported in case studies plus all the HPO ancestors of those phenotypes. Maybe we want to keep only the original phenotypes, or limit how far up the hierarchy we are willing to go. Only add the parents of the original term to the set of phenotypes? Parents and grandparents?
After looking at the new annotations, I think that we could do the following
Random interjection, but I think this may have an effect on the results. Is there an argument to remove certain annotations for comparison? Specifically, I have noticed that "Autosomal recessive" or "Autosomal dominant" or "early onset" are listed as annotations. While this is important information for the disease, is it reasonable to use these as phenotypes?
I noticed those "phenotypes" myself but did not know what to do with them. Currently there is no filtering on the phenotypes, they are drawn from the annotations files available on http://compbio.charite.de/jenkins/job/hpo.annotations.monthly/lastSuccessfulBuild/artifact/annotation/ and I did not think I should second-guess what's listed as an annotation even though some of them are really modes of inheritance. I could filter these out if I locate the common ancestor of the modes of inheritance and then I could toss out any child of that HPO term. It would be an extra step but I think not a difficult one.
Depending on if Peter agrees, I think it would be a good idea to filter out any annotations that are not under "Phenotypic Abnormality". I think that would eliminate all of these annotations. @pnrobinson What do you think?
Leigh's idea is good, we should just use terms under "Phenotypic Abnormality".
per discussion with @pnrobinson : format col1=OMIM id of a disease col2=EntrezGene id of the corresponding disease gene (there may be multiple entries for one disease( col3=shows whether it is a disease ("phenotype") or elsewise (e.g., gene). We are just interested in the phenotype rows:
103780 125 phenotype
medgen ncbi website private final static String MIM2GENE_MEDGEN_URL = "ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/mim2gene_medgen";
parse the lists of GPI and GPI target genes and use mim2gene_medgen file to get the corresponding disease ids. Use OMIM website to check correctness
@pnrobinson @LCCarmody Three genes from the GPI pathway/anchored groups have associated OMIM entries listed in the _mim2genemedgen file, but the OMIM ids do not appear in phenotype.hpoa. They are:
Gene | Group | Disease |
---|---|---|
PIGS (ENTREZ:94005) | Pathway | OMIM:618143 |
ART4 (ENTREZ:420) | Anchored | OMIM:616060 |
SEMA7A (ENTREZ:8482) | Anchored | OMIM:614745 |
Here are the corresponding lines in the _mim2genemedgen file:
618143 94005 phenotype GeneMap NULL -
616060 420 phenotype GeneMap C1292294 nondisease
614745 8482 phenotype GeneMap C3553633 nondisease
Decided to ignore any line of _mim2genemedgen that contains "nondisease" in sixth column. If we ever expand our interest in _mim2genemedgen beyond the GPI genes, might have to worry about other values one can find in the sixth column:
$cut -f6 mim2gene_medgen | sort | uniq
-
Comment
nondisease
nondisease; QTL 2
nondisease; QTL 2; susceptibility
nondisease; nondisease; QTL 2
nondisease; question
nondisease; susceptibility
question
susceptibility
susceptibility; QTL 1
susceptibility; modifier
susceptibility; modifier; question
susceptibility; question
susceptibility; somatic
What similarity function to implement for comparing sets of HPO terms in the gsppc (gene set phenotypic profile comparison) R notebook? The first version used Jaccard similarity. Could re-implement one of the similarity functions from HPOSim: An R Package for Phenotypic Similarity Measure and Enrichment Analysis Based on the Human Phenotype Ontology Yue Deng, Lin Gao, Bingbo Wang, Xingli Guo PLOS ONE, 9 February 2015
However, I don't think this paper has a good metric for comparing sets qua sets. They propose computing the pairwise similarity of every pair of elements (one from the first set, one from the second set) and then choosing the max similarity or the mean or...