openvax / pyensembl

Python interface to access reference genome features (such as genes, transcripts, and exons) from Ensembl
Apache License 2.0
365 stars 68 forks source link

coding_sequence_position_ranges and coding_sequence don't agree #176

Open gshiba opened 7 years ago

gshiba commented 7 years ago
# python 2.7.3
>>> pyensembl.__version__ 
'1.0.3'
>>> ens = pyensembl.EnsemblRelease(85)
>>> t = ens.transcript_by_id('ENST00000311936')

>>> len(t.coding_sequence)
567

>>> sum([b-a+1 for a, b in t.coding_sequence_position_ranges])
564  # does not agree with above

>>> sorted(t.coding_sequence_position_ranges)
[(25209798, 25209911),  # table at link below says (25209795, 25209911)
 (25225614, 25225773),                                     ^
 (25227234, 25227412),
 (25245274, 25245384)]
# link: https://www.ncbi.nlm.nih.gov/CCDS/CcdsBrowse.cgi?REQUEST=CCDS&DATA=CCDS8702

Not sure if this is by design, or my lack of understanding in biology (not a biologist). I've tried a few other transcripts (eg, ENST00000371085, ENST00000275493) and they all had a difference of 3 bases in length as shown above.

PS: My end goal is to implement a function/method that does the opposite of pyensembl.transcript.Transcript.spliced_offset (accepts an offset into the coding sequence; returns the absolute position on genome).

arogozhnikov commented 3 years ago

really old issue. This should be stop codon (3 base pairs), which is placed exactly where you see shift

iskandr commented 3 years ago

Paging back into this -- is the problem that the stop codon gets excluded from the genomic coordinate intervals?

arogozhnikov commented 3 years ago

@iskandr I think so. Not sure how is should behave