openvax / pyensembl

Python interface to access reference genome features (such as genes, transcripts, and exons) from Ensembl
Apache License 2.0
365 stars 68 forks source link

gene class should have a method to get its sequence #270

Open antonkulaga opened 1 year ago

antonkulaga commented 1 year ago

I see that transcripts have the method to get sequences while the gene class does not. Would be nice to be able to get genomic DNA of gene as well.

iskandr commented 1 year ago

PyEnsembl doesn't currently use a full genomic FASTA -- it's possible to point it at one but the sequences would differ from the transcript sequences (even at the same coordinates), which I think would be pretty confusing.

rraadd88 commented 5 months ago

In case there would be a plan to extend pyensembl's compatibility with full genome, I have a few functions in one of repositories that are based on pyensembl's way of caching the files. The wrapper function is called download_genome with following parameters

download_genome(
    species: str,
    ensembl_release: int,
    force: bool = False,
    verbose: bool = False
) → str

So I would be happy to contribute, if there would be a plan to extend compatibility in the future.

Following couple of things that could be challenging though:

  1. additional dependencies in case of a large genome (e.g. human) for fetching the sequences e.g. using UCSC 2bit format, as implemented in my repository.
  2. I don't have a specific example, but for some species, the assembly number for the genome can be different from the one for transcripts and proteins.
iskandr commented 4 months ago

I have also been thinking about adding this functionality now that I'm working on making the OpenVax tools work more cleanly with large/structural variants (and thus have to contend with more events outside of annotated exons). I would love a contribution but maybe we can talk a bit more about the design?

A few thoughts:

rraadd88 commented 4 months ago

About the last point, I haven't encountered any tool ideally suited to piggy-back on for fetching genome sequences. That's why in my python package, I ended up coupling (1) pyensembl's way to download and cacheing (or use already-downloaded genome) with (2) optional 2bit indexing for large genomes. It works quite well, but it is still to be tested extensively, e.g. with different species etc.

There is also a web API that I know. It is quite fast, but (1) it does not cover all the genomes and (2) a limit on number/rate of queries is expected.

In my experience, the challenge lies with the large genome e.g. human. For the smaller ones, the indexed genomes can be downloaded from ensembl (e.g.) and then basic tools such as pyfaidx can fetch sequences at decent rate.

For large genomes though, using 2bit is the fastest approach, to my knowledge. The required tools can be downloaded from conda/mamba using this command:

conda install install bioconda::ucsc-fatotwobit bioconda::ucsc-twobittofa bioconda::ucsc-twobitinfo # options: conda/mamba

Overall, I would vote for a 2bit based approach which I have already implemented in my repo.

One thing to note however, in my implementation, an intermediate bed file is created for fetching the sequences. This is ideally suited for fetching sequences in batches, not one at a time.