monarch-initiative / phenoCompare

Phenotype Compare
BSD 3-Clause "New" or "Revised" License
1 stars 1 forks source link

GPI comparison #10

Closed pnrobinson closed 6 years ago

pnrobinson commented 6 years ago
  1. Goal -- take the set of all "GPI anchor genes" and compare the associated phenotypes with those of the "GPI anchored gene/proteins"
  2. Use the Jaccard-similarioty metric between sets of ontology terms.
  3. To get the HPO terms that are associated with genes, do not take the phenotypes from our biocurated patients, instead take them from this list
    http://compbio.charite.de/jenkins/job/hpo.annotations.monthly/lastSuccessfulBuild/artifact/annotation/ALL_SOURCES_ALL_FREQUENCIES_genes_to_phenotype.txt
  4. Use a simulation to determine whether the Jaccard similarity is significant
  5. Get "random" protein-coding genes from the gene_info ftp://ftp.ncbi.nih.gov/gene/DATA/ and calculate the similarity between the GPI anchor genes and these sets N number of times...the p value is the number of times the similarity was as least as strong divided by N.
pnrobinson commented 6 years ago

gene_info example

9606    646197  LOC646197   -   -   -   8   8q21.11 heat shock protein 90kDa alpha family class B member 1 pseudogene   pseudo  -   -   -   heat shock protein 90kDa alpha (cytosolic), class B member 1 pseudogene 20170408    -

9606=taxon id for human (first column) 646197=gene id LOC646197=gene symbol heat shock protein 90kDa alpha family class B member 1 pseudogene =long description pseudo=gene type

Extract the genes we need:

$ zgrep ^9606 gene_info.gz  | grep protein-coding | cut -f1,2,3 > human_protein_coding_genes.tsv
robinp@ldg-jgm004:~/data/ncbi$ wc -l human_protein_coding_genes.tsv 
20456 human_protein_coding_genes.tsv
pnrobinson commented 6 years ago

Do we need genes_to_phenotypes.txt OR phenotype_to_genes.txt

see: http://human-phenotype-ontology.github.io/faq.html

pnrobinson commented 6 years ago

New function in phenol

public static Set<TermId> getAncestorTerms(
      Ontology<? extends Term, ? extends Relationship> ontology,
      Set<TermId> children,
      boolean includeOriginalTerm) {
    ImmutableSet.Builder<TermId> builder = new ImmutableSet.Builder<>();
    if (includeOriginalTerm) builder.addAll(children);
    Stack<TermId> stack = new Stack<>();
    Set<TermId> parents = getParentTerms(ontology, children, false);
    for (TermId t : parents) stack.push(t);
    while (!stack.empty()) {
      TermId tid = stack.pop();
      builder.add(tid);
      Set<TermId> prnts = getParentTerms(ontology, tid, false);
      for (TermId t : prnts) stack.push(t);
    }
    return builder.build();
  }
hannahblau commented 6 years ago

ftp://ftp.ncbi.nih.gov/gene/DATA/GENE_INFO/Mammalia/Homo_sapiens.gene_info.gz has subset of gene_info for 9606, same format as gene_info.gz

choose set of genes at random but what size set? size of the (as yet to be determined) GPI anchored gene/proteins?

if I'm getting the list of associated HPO terms from ALL_SOURCES_ALL_FREQUENCIES_genes_to_phenotype.txt, I can then find all the ancestors of those phenotypes using the getAncestorTerms method of phenol. Or, get info from ALL_SOURCES_ALL_FREQUENCIES_phenotype_to_genes.txt instead, which already takes account of the ontology. Would have to parse the entire file, then select out the lines relating to the genes of interest and collect all the HPO terms from those lines.

hannahblau commented 6 years ago

PGAP5 does not appear in the NCBI list of human protein-coding genes because it is a synonym for MPPE1, id 65258. https://www.ncbi.nlm.nih.gov/gene/?term=PGAP5%5Bsym%5D

hannahblau commented 6 years ago

Implemented in R notebook see folder gsppc