Closed a-jacobo closed 9 months ago
Hey Adrian,
The simplest fix for you is to download the GTF & FASTA files from Ensembl and create a Genome
object with the local FASTA & GTF paths. I can also add Zebrafish to the list of supported species, but I'm going to first finish up fixing the Travis unit tests (which currently aren't working).
Thanks!
Hi @a-jacobo ,
Here's how I install genome of (my favourite organism) yeast in pyensembl
.
Open this file in text editor (assuming that you are using anaconda on a linux system):
/home/{user}/anaconda/envs/beditor/lib/python3.6/site-packages/pyensembl/species.py
and append genome info in this format.
Note: Enclosed in {}
are the parts you would need to incorporate for your favourite organism.
{speciesname} = Species.register(
latin_name="{}",
synonyms=["{}"],
reference_assemblies={
"{}": (76, MAX_ENSEMBL_RELEASE),
})
eg.
yeast = Species.register(
latin_name="saccharomyces_cerevisiae",
synonyms=["yeast"],
reference_assemblies={
"R64-1-1": (76, MAX_ENSEMBL_RELEASE),
})
Hi @iskandr , To me, looking at species.py, it seems that adding new species can be triggered from bash command. I would send a PR if will manage to do that.
@rraadd88 PR's for additional species would be extremely welcome!
@iskandr I think I just made a solution for adding species. Before sending a PR I just want to know how I would get a path to local cache directory from within pyensembl.
If I do import pyensembl
, I was expecting to see something like pyensembl.local_cache_path
which would give me the cache directory.
You can access it for a particular Genome
object. The dir is currently a little hidden away since the path is determined indirectly through the datacache
package (which creates a subdirectory of appdirs.user_cache_dir
) but can also be overridden by the environment key PYENSEMBL_CACHE_DIR
. I can try to simplify this scheme or at least simplify access to the default cache dir.
In case it's useful: For any particular Genome
/EnsemblRelease
you can access genome.download_cache.cache_directory_path
.
Many thanks @iskandr , I have coded a way to add new genomes right from the bash. However, while doing that I ended up heavily modifying the original code.
Because of the heavy editing, I don't know how the PR would work in this case, but you may test my fork here: https://github.com/rraadd88/pyensembl and see it yourself.
All the original interface remains the same. To install new genome, for example of yeast,
pyensembl install --reference-name R64-1-1 --release 91 --species saccharomyces_cerevisiae
following are the key modifications.
human
here and there. That way the module can be used for other species, without human genome info popping up now than then.Species.dspecies
) which has all the information the program (and a user) need to know about which genomes are currently installed. pyensembl list
now is just a print(Species.dspecies)
.#FIXME
.For future,
I think, the species information contained in the pandas dataframe (Species.dspecies
) can be used to import/export collection of genomes and thereby share across users. (Just like anaconda yml file for virtual environments.). Just an idea.
Hey @rraadd88,
I like some of the ideas in that branch but I think it would be a very big change for a single PR. Let me think about how to make a smaller sequence of PRs and whether I actually want to get rid of the convenience of not having to specify all three of (1) species name (2) reference name (3) Ensembl release.
Also, can you explain more about:
without human genome info popping up now than then.
When does human genome info pop up?
Thanks, Alex
Hi @iskandr ,
Yes. This would be a tricky PR (or series of PRs), if indeed you would want go on this route.
Because I am a regular user of pyensembl
and I like how it works, in favour of the PRs, I would argue that
(1) Firstly, the modifications only pertain to the installation process so main functionalities of pyensembl
would remain the same. I have tested this.
(2) The modifications such as the usage of a dataframe (Species.dspecies
) as a portable registry of genomes, opens new ways to import/export and share collection of genomes across users.
and finally
(3) I think explicitly mentioning (1) species name (2) reference name (assembly) and (3) Ensembl release would make installation of genomes more convenient for user (if at all it's a big effort in the first place). Because then they don't have to look for which assembly and release was automatically downloaded by pyensembl
.
Therefore, I would be happy to help in PRs.
Regarding,
without human genome info popping up now than then.
sorry, I really did not make it clear there. I wanted to say that if defaults of inputs are set to human
(eg. 1, 2 and 3) and if somebody working with other species uses the those pieces of codes, without specifying the defaults, they may find human
related information popping up unexpectedly. I hope I am making it clear. It may not be that important because I suppose this potential issue would have been considered before assigning defaults to human
at those places. However, in my opinion, removing them would make things more strict.
Best, Rohan.
Sorry, it's almost 4 years now, any progress?
The CLI addition of species never happened and pyensembl has been only lightly maintained since I moved to UNC. I'm starting to work on the OpenVax stack again and just merged a PR increasing the number of species supported.
Which one did you need again?
Thanks for that.
These are the species we are looking for:
GENOMES:
- 'Homo sapiens [GRCh38]'
- 'Mus musculus [GRCm39]'
- 'Rattus norvegicus [mRatBN7]'
- 'Danio rerio [GRCz11]'
- 'Drosophila melanogaster [BDGP6]'
Hi,
I'm trying to download the zebrafish ensembl data, but I get an error. When I type:
pyensembl install --release 92 --species danio_rerio
I get:
ValueError: Species not found: danio_rerio
If I check the ensmbl ftp server the data is of course there: ftp://ftp.ensembl.org/pub/release-92/gtf/danio_rerio/
If there is no easy fix for this, is there a way to download the data manually and put it into the database?
Thanks! Adrian.