Closed pnrobinson closed 6 years ago
gene_info example
9606 646197 LOC646197 - - - 8 8q21.11 heat shock protein 90kDa alpha family class B member 1 pseudogene pseudo - - - heat shock protein 90kDa alpha (cytosolic), class B member 1 pseudogene 20170408 -
9606=taxon id for human (first column) 646197=gene id LOC646197=gene symbol heat shock protein 90kDa alpha family class B member 1 pseudogene =long description pseudo=gene type
Extract the genes we need:
$ zgrep ^9606 gene_info.gz | grep protein-coding | cut -f1,2,3 > human_protein_coding_genes.tsv
robinp@ldg-jgm004:~/data/ncbi$ wc -l human_protein_coding_genes.tsv
20456 human_protein_coding_genes.tsv
Do we need genes_to_phenotypes.txt OR phenotype_to_genes.txt
New function in phenol
public static Set<TermId> getAncestorTerms(
Ontology<? extends Term, ? extends Relationship> ontology,
Set<TermId> children,
boolean includeOriginalTerm) {
ImmutableSet.Builder<TermId> builder = new ImmutableSet.Builder<>();
if (includeOriginalTerm) builder.addAll(children);
Stack<TermId> stack = new Stack<>();
Set<TermId> parents = getParentTerms(ontology, children, false);
for (TermId t : parents) stack.push(t);
while (!stack.empty()) {
TermId tid = stack.pop();
builder.add(tid);
Set<TermId> prnts = getParentTerms(ontology, tid, false);
for (TermId t : prnts) stack.push(t);
}
return builder.build();
}
ftp://ftp.ncbi.nih.gov/gene/DATA/GENE_INFO/Mammalia/Homo_sapiens.gene_info.gz has subset of gene_info for 9606, same format as gene_info.gz
choose set of genes at random but what size set? size of the (as yet to be determined) GPI anchored gene/proteins?
if I'm getting the list of associated HPO terms from ALL_SOURCES_ALL_FREQUENCIES_genes_to_phenotype.txt, I can then find all the ancestors of those phenotypes using the getAncestorTerms method of phenol. Or, get info from ALL_SOURCES_ALL_FREQUENCIES_phenotype_to_genes.txt instead, which already takes account of the ontology. Would have to parse the entire file, then select out the lines relating to the genes of interest and collect all the HPO terms from those lines.
PGAP5 does not appear in the NCBI list of human protein-coding genes because it is a synonym for MPPE1, id 65258. https://www.ncbi.nlm.nih.gov/gene/?term=PGAP5%5Bsym%5D
Implemented in R notebook see folder gsppc
http://compbio.charite.de/jenkins/job/hpo.annotations.monthly/lastSuccessfulBuild/artifact/annotation/ALL_SOURCES_ALL_FREQUENCIES_genes_to_phenotype.txt