openvax / pyensembl

Python interface to access reference genome features (such as genes, transcripts, and exons) from Ensembl
Apache License 2.0
374 stars 65 forks source link

Can't retrieve transcript's sequence from a FASTA file #193

Open vnvkotova opened 6 years ago

vnvkotova commented 6 years ago

Dear pyensembl developers,

I want to get a sequence for a given transcript from a FASTA file. I used the following code to define and initiate my genome: data = pyensembl.Genome(reference_name='hg38', annotation_name='hg38_chr22', gtf_path_or_url='...GRCh38.83.gtf', transcript_fasta_path_or_url = '.../hg38.fa') data.index()

After running it I can get the basic information about genes and transcripts that the files have, e.g. if I run: print(data.transcript_ids(22, '+')) it gives me a list with ids. But I can't get a sequence for a given transcript. Running this script: print(data.transcript_by_id('ENST00000624155').sequence) gives me "None".

I checked several different combinations of GTF and FASTA files. The result was the same for all of them, therefore I'm certain that the problem is not caused by the files.

I'll really appreciate your reply!

Best regards, Nika

joaoe commented 6 years ago

I think it would help to debug if you pasted here the header in the FASTA file for that transcript. Could be that the code that parses the FASTA file does not recognize the syntax.

trappedinspacetime commented 5 years ago

Hi, I just tested the command you posted, it works well in my case:

    pyensembl install --release 93 --species human
    python3
    >>> from pyensembl import EnsemblRelease
    >>> data = EnsemblRelease(93)
    >>> print(data.transcript_by_id('ENST00000624155').sequence)
    INFO:pyensembl.sequence_data:Loaded sequence dictionary from /home/*****/.cache/pyensembl/GRCh38/ensembl93/Homo_sapiens.GRCh38.cdna.all.fa.gz.pickle
   _data:Loaded sequence dictionary from /home/*****/.cache/pyensembl/GRCh38/ensembl93/Homo_sapiens.GRCh38.ncrna.fa.gz.pickle

   ATGGTGGTGGCAACAGAGATGGCAGCGCGGCTGGAGTGTTAGGAGGGTGGCCTGAGCAGTAGGATTGGGGCTGGAGCAGTAAGATGGCAGCCGGAGCGGTTTTTCTGGCATTGTCTGCCCAGCTGCTCCAAGCCAGACTGATGAAGGAGGAGTCCCCAGTGGTGAGCTGGAGGTTGGAGCCTGAAGATGGCACAGCTCTGTGATTCATCTTCTGCGGTTGTGGCAGCCACGGTGATGGAGACGGCAGCTCAACAGGAGCAATAGGAGGGTACCCATGGAGGCCAAGTG