Open antonkulaga opened 1 year ago
PyEnsembl doesn't currently use a full genomic FASTA -- it's possible to point it at one but the sequences would differ from the transcript sequences (even at the same coordinates), which I think would be pretty confusing.
In case there would be a plan to extend pyensembl's compatibility with full genome, I have a few functions in one of repositories that are based on pyensembl's way of caching the files.
The wrapper function is called download_genome
with following parameters
download_genome(
species: str,
ensembl_release: int,
force: bool = False,
verbose: bool = False
) → str
So I would be happy to contribute, if there would be a plan to extend compatibility in the future.
Following couple of things that could be challenging though:
I have also been thinking about adding this functionality now that I'm working on making the OpenVax tools work more cleanly with large/structural variants (and thus have to contend with more events outside of annotated exons). I would love a contribution but maybe we can talk a bit more about the design?
A few thoughts:
pyensembl install
shouldn't do it by default)About the last point, I haven't encountered any tool ideally suited to piggy-back on for fetching genome sequences. That's why in my python package, I ended up coupling (1) pyensembl's way to download and cacheing (or use already-downloaded genome) with (2) optional 2bit indexing for large genomes. It works quite well, but it is still to be tested extensively, e.g. with different species etc.
There is also a web API that I know. It is quite fast, but (1) it does not cover all the genomes and (2) a limit on number/rate of queries is expected.
In my experience, the challenge lies with the large genome e.g. human. For the smaller ones, the indexed genomes can be downloaded from ensembl (e.g.) and then basic tools such as pyfaidx can fetch sequences at decent rate.
For large genomes though, using 2bit is the fastest approach, to my knowledge. The required tools can be downloaded from conda/mamba using this command:
conda install install bioconda::ucsc-fatotwobit bioconda::ucsc-twobittofa bioconda::ucsc-twobitinfo # options: conda/mamba
Overall, I would vote for a 2bit based approach which I have already implemented in my repo.
One thing to note however, in my implementation, an intermediate bed
file is created for fetching the sequences. This is ideally suited for fetching sequences in batches, not one at a time.
I see that transcripts have the method to get sequences while the gene class does not. Would be nice to be able to get genomic DNA of gene as well.