openvax / pyensembl

Python interface to access reference genome features (such as genes, transcripts, and exons) from Ensembl
Apache License 2.0
365 stars 68 forks source link

Cannot Access the coding sequence in specific transcript #267

Open HealHer opened 1 year ago

HealHer commented 1 year ago

Hi,

Thank you very much for this package. I am using it daily for getting sequence and relevant information about transcripts.

I am getting a bit confused with one particular example regarding the transcript named ENST00000429617. I am following this method to get it:

pyensembl install --release 55 --species homo_sapiens
python3
>>> ensembl = pyensembl.EnsemblRelease(release=75)
>>> tx = ensembl.transcript_by_id("ENST00000429617")

I am trying to recover the coding sequence so I tried: tx.coding_sequence with the error which stitulate that there are no start codon involved.

When I looked inside the GTF there is a coding sequence associated for this specific transcript (starting from exon 2) grep "ENST00000429617" Homo_sapiens.GRCh37.75.gtf

10      protein_coding  transcript      115438942       115486178       .       +       .       gene_id "ENSG00000165806"; transcript_id "ENST00000429617"; gene_name "CASP7"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "CASP7-004"; transcript_source "havana"; tag "cds_end_NF"; tag "mRNA_end_NF";
10      protein_coding  exon    115438942       115439108       .       +       .       gene_id "ENSG00000165806"; transcript_id "ENST00000429617"; exon_number "1"; gene_name "CASP7"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "CASP7-004"; transcript_source "havana"; exon_id "ENSE00001604369"; tag "cds_end_NF"; tag "mRNA_end_NF";
10      protein_coding  exon    115457253       115457362       .       +       .       gene_id "ENSG00000165806"; transcript_id "ENST00000429617"; exon_number "2"; gene_name "CASP7"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "CASP7-004"; transcript_source "havana"; exon_id "ENSE00003512533"; tag "cds_end_NF"; tag "mRNA_end_NF";
10      protein_coding  CDS     115457253       115457362       .       +       0       gene_id "ENSG00000165806"; transcript_id "ENST00000429617"; exon_number "2"; gene_name "CASP7"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "CASP7-004"; transcript_source "havana"; protein_id "ENSP00000400094"; tag "cds_end_NF"; tag "mRNA_end_NF";
10      protein_coding  exon    115480791       115480927       .       +       .       gene_id "ENSG00000165806"; transcript_id "ENST00000429617"; exon_number "3"; gene_name "CASP7"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "CASP7-004"; transcript_source "havana"; exon_id "ENSE00003505017"; tag "cds_end_NF"; tag "mRNA_end_NF";
10      protein_coding  CDS     115480791       115480927       .       +       1       gene_id "ENSG00000165806"; transcript_id "ENST00000429617"; exon_number "3"; gene_name "CASP7"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "CASP7-004"; transcript_source "havana"; protein_id "ENSP00000400094"; tag "cds_end_NF"; tag "mRNA_end_NF";
10      protein_coding  exon    115481410       115481538       .       +       .       gene_id "ENSG00000165806"; transcript_id "ENST00000429617"; exon_number "4"; gene_name "CASP7"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "CASP7-004"; transcript_source "havana"; exon_id "ENSE00003555462"; tag "cds_end_NF"; tag "mRNA_end_NF";
10      protein_coding  CDS     115481410       115481538       .       +       2       gene_id "ENSG00000165806"; transcript_id "ENST00000429617"; exon_number "4"; gene_name "CASP7"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "CASP7-004"; transcript_source "havana"; protein_id "ENSP00000400094"; tag "cds_end_NF"; tag "mRNA_end_NF";
10      protein_coding  exon    115485121       115485296       .       +       .       gene_id "ENSG00000165806"; transcript_id "ENST00000429617"; exon_number "5"; gene_name "CASP7"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "CASP7-004"; transcript_source "havana"; exon_id "ENSE00003542525"; tag "cds_end_NF"; tag "mRNA_end_NF";
10      protein_coding  CDS     115485121       115485296       .       +       2       gene_id "ENSG00000165806"; transcript_id "ENST00000429617"; exon_number "5"; gene_name "CASP7"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "CASP7-004"; transcript_source "havana"; protein_id "ENSP00000400094"; tag "cds_end_NF"; tag "mRNA_end_NF";
10      protein_coding  exon    115486064       115486178       .       +       .       gene_id "ENSG00000165806"; transcript_id "ENST00000429617"; exon_number "6"; gene_name "CASP7"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "CASP7-004"; transcript_source "havana"; exon_id "ENSE00001632192"; tag "cds_end_NF"; tag "mRNA_end_NF";
10      protein_coding  CDS     115486064       115486178       .       +       0       gene_id "ENSG00000165806"; transcript_id "ENST00000429617"; exon_number "6"; gene_name "CASP7"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "CASP7-004"; transcript_source "havana"; protein_id "ENSP00000400094"; tag "cds_end_NF"; tag "mRNA_end_NF";
10      protein_coding  UTR     115438942       115439108       .       +       .       gene_id "ENSG00000165806"; transcript_id "ENST00000429617"; gene_name "CASP7"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "CASP7-004"; transcript_source "havana"; tag "cds_end_NF"; tag "mRNA_end_NF";

And this is also confirmed by the UCSC genome browser (link to specific region)

Can you please point me to where I can access this coding sequence?