openvax / pyensembl

Python interface to access reference genome features (such as genes, transcripts, and exons) from Ensembl
Apache License 2.0
365 stars 66 forks source link

Difference between transcript.exon_intervals and transcript.exons #221

Open ysbioinfo opened 5 years ago

ysbioinfo commented 5 years ago

Hi, Thanks for developing such a excellent tool! I want to pyensembl to get all exon coordinates of a transcript. It seems transcript.exon_intervals is the most straightforward way, but I found the coordinates I got from transcript.exon_intervals are not same as those from [transcript.exons.start transcript.exons.end]. For example, the transcript ENST00000321265: Transcript(transcript_id='ENST00000321265', transcript_name='NUDC-001', gene_id='ENSG00000090273', biotype='protein_coding', contig='1', start=27248217, end=27273353, strand='+', genome='User-defined') When I use transcript.exons, it shows there are 9 exons:

[Exon(exon_id='ENSE00001390272', gene_id='ENSG00000090273', gene_name='NUDC', contig='1', start=27248217, end=27248420, strand='+'),
 Exon(exon_id='ENSE00003537130', gene_id='ENSG00000090273', gene_name='NUDC', contig='1', start=27250580, end=27250657, strand='+'),
 Exon(exon_id='ENSE00000872663', gene_id='ENSG00000090273', gene_name='NUDC', contig='1', start=27267948, end=27268151, strand='+'),
 Exon(exon_id='ENSE00001222819', gene_id='ENSG00000090273', gene_name='NUDC', contig='1', start=27268244, end=27268309, strand='+'),
 Exon(exon_id='ENSE00003679096', gene_id='ENSG00000090273', gene_name='NUDC', contig='1', start=27269151, end=27269267, strand='+'),
 Exon(exon_id='ENSE00001222807', gene_id='ENSG00000090273', gene_name='NUDC', contig='1', start=27269362, end=27269556, strand='+'),
 Exon(exon_id='ENSE00001222803', gene_id='ENSG00000090273', gene_name='NUDC', contig='1', start=27271881, end=27271964, strand='+'),
 Exon(exon_id='ENSE00001222797', gene_id='ENSG00000090273', gene_name='NUDC', contig='1', start=27272059, end=27272177, strand='+'),
 Exon(exon_id='ENSE00001423146', gene_id='ENSG00000090273', gene_name='NUDC', contig='1', start=27272621, end=27273353, strand='+')]

But when I use transcript.exon_intervals, there are 2049 intervals:

[(27248217, 27248420),
 (27250580, 27250657),
 (27267948, 27268151),
 (27268244, 27268309),
 (27269151, 27269267),
 (27269362, 27269556),
 (27271881, 27271964),
 (27272059, 27272177),
 (27272621, 27273353),
 (27248421, 27250579),
 (27250658, 27267947),
 (27268152, 27268243),
 (27268310, 27269150),
 (27269268, 27269361),
 (27269557, 27271880),
 (27271965, 27272058),
 (27272178, 27272620),
 (27248421, 27250579),
...

I found the first 9 intervals are same as those in transcript.exons. I want to know why there is such difference. Which one should I believe(exons or exon_intervals)? Thanks! (The ensembl release I used is 75, human)

Yang