schae234 / lololog

A python library for finding orthologous genes amongst species
MIT License
0 stars 0 forks source link

Mapping unknown Equus genes to known Human genes #1

Closed schae234 closed 6 years ago

schae234 commented 8 years ago

Use cases for the Lololog package

The functional unit we use in genetics is the Gene. The foundation of molecular biology assumes that the molecules that make up DNA are transcribed to RNA (called transcripts) and then translated into proteins which are the units that make up a cell and perform different tasks based on their literal physical composition. central dogma

Mutations in a genes sequence can lead to slightly different physical arrangements in the proteins which lead to different physical composition of proteins, and eventually different physical differences in function. Most of the time, mutations do nothing! Genetics attempts to find the specific mutation(s) that are responsible for specific differences in physical characteristics, i.e. which mutations cause lactose-intolerance?

Use Case 1: using known gene function in other species to describe unknown function

Despite very different composure between different species, many gene functions have been shown to be very highly conserved across species, especially for "core" biological processes. For example, plants all perform photosynthesis and they largely share the same cellular structure to perform the task. They all evolved from a common ancestor that performed photosynthesis. Each Individual species still has those same genes that they inherited many generations ago and we can use this common heritage to help us identify the function of genes in poorly studied species by using annotations from well studies species. For example:

> Horse Gene abc123 : unknown function:
acgcagagcgagatagacgcggagatcgagcatcggctagcgcagctagctagcgtagaggatcgatca

> Human Gene def456 : responsible for eye color:
acgcagagcgagatagacgcggagatcgagcatcggctagcgcagctagctagcgtagaggatcgatca

Since we know that the gene sequence is very tightly linked to its function, we can safely say that the gene in the horse probably has an impact on eye color because it is exactly the same as the human gene! Both species probably inherited the original gene from a common ancestor millions of years ago! This relationship between genes that are present in both species, evolving from a common ancestor is called an ortholog.

Using this python package, we want to be able to cover the following use case:

Greg is interested in horse disease. Through lab experiments he has found a mutation that he thinks is responsible for a muscle disease. Unfortunately the horse genome is only studied by a few labs throughout the world and thus not very many experiments have been done to verify gene function experimentally. Greg knows that on a core, tissue level, a horse muscle is not that different than human muscle. The same biological pathways and processes dictate what proteins make up muscle and also how muscles contract and move. Knowing this, Greg is interested in genes that have been experimentally characterized in other, 'model', species. Greg download and installs the lololog package. He starts up the command line interface and queries the horse database. He enters the horse gene name: abc123 and retrieve the human name of the gene: def456 based on their sequence similarity. He also finds several mouse genes that have high sequence similarity. Greg pull up scientific literature on these genes to determine if their function is similar to that seen in horses. Greg is happy and decides to take everyone in the lab out for smoothies :banana: :cherries: :strawberry: :green_apple:

Tasks

The ortholog problem is a well studied problem. Before we come up with a potentially new way to find orthologs, we should leverage information that is already out there. One famous database is called OrthoMCL[1]. Unfortunately OrthoMCL does not use the same gene identifies that we do in the lab. We use 'Ensembl' gene names which are specified in this gene format file (GFF).

[1]: OrthoMCL: Identification of Ortholog Groups for Eukaryotic Genomes

jackstanek commented 8 years ago

OK. I've put a few lists of gene IDs in the res/ folder.

Am I on the right track with this? Or am I doing it wrong...

schae234 commented 8 years ago

Ensembl peptide sequences: ftp://ftp.ensembl.org/pub/release-84/fasta/equus_caballus/pep/

Cat the "all" and the "abinitio" together to make one big fasta.