openvax / pyensembl

Python interface to access reference genome features (such as genes, transcripts, and exons) from Ensembl
Apache License 2.0
374 stars 65 forks source link

Discordant number of Exons #110

Closed alec-djinn closed 9 years ago

alec-djinn commented 9 years ago

I was checking the exons of the gene TTC28 on GRCh37, pyensembl (0.6.8) found 39 for that gene but other sources (also based on GRCh37) states that there are only 23. Could you explain why there is such a difference?

The link of the "other sources" are: http://www.ensembl.org/Homo_sapiens/Transcript/Summary?db=core;g=ENSG00000100154;r=22:27978014-28679865;t=ENST00000397906 http://www.nextprot.org/db/entry/NX_Q96AY4/exons http://www.omim.org/entry/615098 http://www.rcsb.org/pdb/gene/TTC28?chromosome=chr22&range=27999260

The code I have used is the following:

 >>> data = EnsemblRelease(75, auto_download=True)
 >>> data
 EnsemblRelease(release=75, species=homo_sapiens, genome=GRCh37)
 >>> a = data.exon_ids_of_gene_name('TTC28')
 INFO:root:Cached file Homo_sapiens.GRCh37.75.gtf from URL ftp://ftp.ensembl.org/pub/release-75/gtf/homo_sapiens/Homo_sapiens.GRCh37.75.gtf.gz
 >>> a
['ENSE00001556591', 'ENSE00001551541', 'ENSE00003640178', 'ENSE00003492820', 'ENSE00001492788', 'ENSE00001492787', 'ENSE00001231767', 'ENSE00000879682', 'ENSE00001173060', 'ENSE00001173052', 'ENSE00001173076', 'ENSE00000651956', 'ENSE00001297696', 'ENSE00000879679', 'ENSE00000651949', 'ENSE00000651948', 'ENSE00000651947', 'ENSE00003685416', 'ENSE00000651945', 'ENSE00001321146', 'ENSE00000651944', 'ENSE00003566454', 'ENSE00001559219', 'ENSE00001726746', 'ENSE00001625734', 'ENSE00001852426', 'ENSE00003550588', 'ENSE00001840432', 'ENSE00001878331', 'ENSE00003602135', 'ENSE00001859545', 'ENSE00001757870', 'ENSE00001654926', 'ENSE00001862939', 'ENSE00003566593', 'ENSE00003667769', 'ENSE00001817484', 'ENSE00001938107', 'ENSE00001917942']
 >>> len(a)
 39
iskandr commented 9 years ago

Hey Alec,

You're seeing all the distinct exon IDs across all the transcripts of "TTC28". The primary transcript from ensembl75 has 23 exons, and the others are either incomplete or are retained processed transcripts.

In [1]: for transcript in ensembl_grch37.genes_by_name("TTC28")[0].transcripts:
   ...:     print(transcript.name, len(transcript.exons))
   ...:
TTC28-001 23
TTC28-005 6
TTC28-006 3
TTC28-004 3
TTC28-003 3
TTC28-007 4
TTC28-002 2
alec-djinn commented 9 years ago

Clear! Thank you.