openvax / pyensembl

Python interface to access reference genome features (such as genes, transcripts, and exons) from Ensembl
Apache License 2.0
374 stars 65 forks source link

Species not found error #207

Closed a-jacobo closed 9 months ago

a-jacobo commented 6 years ago

Hi,

I'm trying to download the zebrafish ensembl data, but I get an error. When I type:

pyensembl install --release 92 --species danio_rerio

I get:

ValueError: Species not found: danio_rerio

If I check the ensmbl ftp server the data is of course there: ftp://ftp.ensembl.org/pub/release-92/gtf/danio_rerio/

If there is no easy fix for this, is there a way to download the data manually and put it into the database?

Thanks! Adrian.

iskandr commented 6 years ago

Hey Adrian,

The simplest fix for you is to download the GTF & FASTA files from Ensembl and create a Genome object with the local FASTA & GTF paths. I can also add Zebrafish to the list of supported species, but I'm going to first finish up fixing the Travis unit tests (which currently aren't working).

a-jacobo commented 6 years ago

Thanks!

rraadd88 commented 6 years ago

Hi @a-jacobo , Here's how I install genome of (my favourite organism) yeast in pyensembl.

Open this file in text editor (assuming that you are using anaconda on a linux system):

/home/{user}/anaconda/envs/beditor/lib/python3.6/site-packages/pyensembl/species.py

and append genome info in this format.
Note: Enclosed in {} are the parts you would need to incorporate for your favourite organism.

{speciesname} = Species.register(
    latin_name="{}",
    synonyms=["{}"],
    reference_assemblies={
        "{}": (76, MAX_ENSEMBL_RELEASE),
    })

eg.

yeast = Species.register(
    latin_name="saccharomyces_cerevisiae",
    synonyms=["yeast"],
    reference_assemblies={
        "R64-1-1": (76, MAX_ENSEMBL_RELEASE),
    })

Hi @iskandr , To me, looking at species.py, it seems that adding new species can be triggered from bash command. I would send a PR if will manage to do that.

iskandr commented 6 years ago

@rraadd88 PR's for additional species would be extremely welcome!

rraadd88 commented 6 years ago

@iskandr I think I just made a solution for adding species. Before sending a PR I just want to know how I would get a path to local cache directory from within pyensembl.

If I do import pyensembl, I was expecting to see something like pyensembl.local_cache_path which would give me the cache directory.

iskandr commented 6 years ago

You can access it for a particular Genome object. The dir is currently a little hidden away since the path is determined indirectly through the datacache package (which creates a subdirectory of appdirs.user_cache_dir) but can also be overridden by the environment key PYENSEMBL_CACHE_DIR. I can try to simplify this scheme or at least simplify access to the default cache dir.

iskandr commented 6 years ago

In case it's useful: For any particular Genome/EnsemblRelease you can access genome.download_cache.cache_directory_path.

rraadd88 commented 6 years ago

Many thanks @iskandr , I have coded a way to add new genomes right from the bash. However, while doing that I ended up heavily modifying the original code.

Because of the heavy editing, I don't know how the PR would work in this case, but you may test my fork here: https://github.com/rraadd88/pyensembl and see it yourself. All the original interface remains the same. To install new genome, for example of yeast, pyensembl install --reference-name R64-1-1 --release 91 --species saccharomyces_cerevisiae

following are the key modifications.

  1. I have removed significance for synonymous names of species. I feel using redundant labels is always problematic. Now user has to provide exact name of the species. If they don't, then simply, the download scripts won't work.
  2. I have removed the defaults to human here and there. That way the module can be used for other species, without human genome info popping up now than then.
  3. Species object has now a pandas dataframe attached (Species.dspecies) which has all the information the program (and a user) need to know about which genomes are currently installed. pyensembl list now is just a print(Species.dspecies).
  4. I have used "fprint" here and there. I don't know how they would be compatible with python 2.7.
    • Rest remarkable changes are labeled as #FIXME.

For future, I think, the species information contained in the pandas dataframe (Species.dspecies ) can be used to import/export collection of genomes and thereby share across users. (Just like anaconda yml file for virtual environments.). Just an idea.

iskandr commented 6 years ago

Hey @rraadd88,

I like some of the ideas in that branch but I think it would be a very big change for a single PR. Let me think about how to make a smaller sequence of PRs and whether I actually want to get rid of the convenience of not having to specify all three of (1) species name (2) reference name (3) Ensembl release.

Also, can you explain more about:

without human genome info popping up now than then.

When does human genome info pop up?

Thanks, Alex

rraadd88 commented 6 years ago

Hi @iskandr , Yes. This would be a tricky PR (or series of PRs), if indeed you would want go on this route. Because I am a regular user of pyensembl and I like how it works, in favour of the PRs, I would argue that
(1) Firstly, the modifications only pertain to the installation process so main functionalities of pyensembl would remain the same. I have tested this.
(2) The modifications such as the usage of a dataframe (Species.dspecies) as a portable registry of genomes, opens new ways to import/export and share collection of genomes across users.
and finally (3) I think explicitly mentioning (1) species name (2) reference name (assembly) and (3) Ensembl release would make installation of genomes more convenient for user (if at all it's a big effort in the first place). Because then they don't have to look for which assembly and release was automatically downloaded by pyensembl.

Therefore, I would be happy to help in PRs.

Regarding,

without human genome info popping up now than then.

sorry, I really did not make it clear there. I wanted to say that if defaults of inputs are set to human (eg. 1, 2 and 3) and if somebody working with other species uses the those pieces of codes, without specifying the defaults, they may find human related information popping up unexpectedly. I hope I am making it clear. It may not be that important because I suppose this potential issue would have been considered before assigning defaults to human at those places. However, in my opinion, removing them would make things more strict.

Best, Rohan.

alanwilter commented 2 years ago

Sorry, it's almost 4 years now, any progress?

iskandr commented 2 years ago

The CLI addition of species never happened and pyensembl has been only lightly maintained since I moved to UNC. I'm starting to work on the OpenVax stack again and just merged a PR increasing the number of species supported.

Which one did you need again?

alanwilter commented 2 years ago

Thanks for that.

These are the species we are looking for:

GENOMES:
  - 'Homo sapiens [GRCh38]'
  - 'Mus musculus [GRCm39]'
  - 'Rattus norvegicus [mRatBN7]'
  - 'Danio rerio [GRCz11]'
  - 'Drosophila melanogaster [BDGP6]'