openvax / pyensembl

Python interface to access reference genome features (such as genes, transcripts, and exons) from Ensembl
Apache License 2.0
374 stars 65 forks source link

Gene coordinates discrepancies #98

Closed alec-djinn closed 9 years ago

alec-djinn commented 9 years ago

I noticed that the coordinates of some genes returned by pyensembl differs from the one published in the NCBI website.

Examples (EnsembleRealese 79 vs GCRh38.p2): I used the following code to get the coordinates:

from pyensembl import EnsemblRelease

data = EnsemblRelease(79, auto_download=True)
_genes = data.genes(contig=2, strand='-')
for item in _genes:    
    _data     = str(item).replace(')','').split(',')
    _id       = _data[0].split('=')[-1]
    _name     = _data[1].split('=')[-1]
    _biotype  = _data[2].split('=')[-1]
    _location = _data[3].split('=')[-1]
    _chr, _ss = _location.split(':')
    _start, _stop = _ss.split('-')

gene BOK-AS1: for pyensembl is on chr2 at 241544403-241558977 but on NCBI is 241544384..241559143

gene CICP10 for pyensembl is on chr2 at 242119856-242120053 but on NCBI is at 242119877..242120142

I could continue with many other examples. Most of the time the coordinates match perfectly, but not always. Why is so? Is it a bug or am I missing something?

iskandr commented 9 years ago

These are general differences between RefSeq and Ensembl. They contain slightly different transcripts and thus the gene boundaries can differ.

CICP10 on Ensembl: Chromosome 2: 242,119,856-242,120,053 reverse strand.

CICP10 on RefSeq: 242119877..242120142, complement

You can read more about the differences in A comprehensive evaluation of ensembl, RefSeq, and UCSC annotations in the context of RNA-seq read mapping and gene quantification