Closed schae234 closed 6 years ago
OK. I've put a few lists of gene IDs in the res/ folder.
unknown-horse.txt
: horse genes with no known mouse or human orthologs in the Ensembl databasehorse-no-cow-exon-seq.txt
: exon sequences of horse genes with no known cow orthologsAm I on the right track with this? Or am I doing it wrong...
Ensembl peptide sequences: ftp://ftp.ensembl.org/pub/release-84/fasta/equus_caballus/pep/
Cat the "all" and the "abinitio" together to make one big fasta.
Use cases for the Lololog package
The functional unit we use in genetics is the Gene. The foundation of molecular biology assumes that the molecules that make up DNA are transcribed to RNA (called transcripts) and then translated into proteins which are the units that make up a cell and perform different tasks based on their literal physical composition.
Mutations in a genes sequence can lead to slightly different physical arrangements in the proteins which lead to different physical composition of proteins, and eventually different physical differences in function. Most of the time, mutations do nothing! Genetics attempts to find the specific mutation(s) that are responsible for specific differences in physical characteristics, i.e. which mutations cause lactose-intolerance?
Use Case 1: using known gene function in other species to describe unknown function
Despite very different composure between different species, many gene functions have been shown to be very highly conserved across species, especially for "core" biological processes. For example, plants all perform photosynthesis and they largely share the same cellular structure to perform the task. They all evolved from a common ancestor that performed photosynthesis. Each Individual species still has those same genes that they inherited many generations ago and we can use this common heritage to help us identify the function of genes in poorly studied species by using annotations from well studies species. For example:
Since we know that the gene sequence is very tightly linked to its function, we can safely say that the gene in the horse probably has an impact on eye color because it is exactly the same as the human gene! Both species probably inherited the original gene from a common ancestor millions of years ago! This relationship between genes that are present in both species, evolving from a common ancestor is called an ortholog.
Using this python package, we want to be able to cover the following use case:
The ortholog problem is a well studied problem. Before we come up with a potentially new way to find orthologs, we should leverage information that is already out there. One famous database is called OrthoMCL[1]. Unfortunately OrthoMCL does not use the same gene identifies that we do in the lab. We use 'Ensembl' gene names which are specified in this gene format file (GFF).
[1]: OrthoMCL: Identification of Ortholog Groups for Eukaryotic Genomes