Missing ensembl versions

emdann commented 1 year ago

Report

The default Ensembl version in ensembldb is v86, but this doesn't seem to exist in the sql database.

import pooch

PKG_CACHE_DIR = "genomic-annotations"
url_template = "https://bioconductorhubs.blob.core.windows.net/annotationhub/AHEnsDbs/v{version}/EnsDb.{species}.v{version}.sqlite"

local_path = pooch.retrieve(
        url = url_template.format(version="86", species="Hsapiens"),
        known_hash=None,
        path=pooch.os_cache(PKG_CACHE_DIR),
        progressbar=True,
    )

Downloading data from 'https://bioconductorhubs.blob.core.windows.net/annotationhub/AHEnsDbs/v86/EnsDb.Hsapiens.v86.sqlite' to file '/home/jovyan/.cache/genomic-annotations/1c849fa5b54f90367dbde8fd3f560e9a-EnsDb.Hsapiens.v86.sqlite'.
---------------------------------------------------------------------------
HTTPError                                 Traceback (most recent call last)
Input In [120], in <cell line: 4>()
      1 import pooch
      2 url_template = "[https://bioconductorhubs.blob.core.windows.net/annotationhub/AHEnsDbs/v](https://bioconductorhubs.blob.core.windows.net/annotationhub/AHEnsDbs/v%3C/span%3E%3Cspan) class="ansi-bold" style="color:rgb(175,95,135)">{version}/EnsDb.{species}.v{version}.sqlite"
----> 4 local_path = pooch.retrieve(
      5         url = url_template.format(version="86", species="Hsapiens"),
      6         known_hash=None,
      7         path=pooch.os_cache(PKG_CACHE_DIR),
      8         progressbar=True,
      9     )

File ~/my-conda-envs/genomic-features/lib/python3.9/site-packages/pooch/core.py:239, in retrieve(url, known_hash, fname, path, processor, downloader, progressbar)
    236 if downloader is None:
    237     downloader = choose_downloader(url, progressbar=progressbar)
--> 239 stream_download(url, full_path, known_hash, downloader, pooch=None)
    241 if known_hash is None:
    242     get_logger().info(
    243         "SHA256 hash of downloaded file: %s\n"
    244         "Use this value as the 'known_hash' argument of 'pooch.retrieve'"
   (...)
    247         file_hash(str(full_path)),
    248     )

File ~/my-conda-envs/genomic-features/lib/python3.9/site-packages/pooch/core.py:803, in stream_download(url, fname, known_hash, downloader, pooch, retry_if_failed)
    799 try:
    800     # Stream the file to a temporary so that we can safely check its
    801     # hash before overwriting the original.
    802     with temporary_file(path=str(fname.parent)) as tmp:
--> 803         downloader(url, tmp, pooch)
    804         hash_matches(tmp, known_hash, strict=True, source=str(fname.name))
    805         shutil.move(tmp, str(fname))

File ~/my-conda-envs/genomic-features/lib/python3.9/site-packages/pooch/downloaders.py:207, in HTTPDownloader.__call__(self, url, output_file, pooch, check_only)
    205 try:
    206     response = requests.get(url, **kwargs)
--> 207     response.raise_for_status()
    208     content = response.iter_content(chunk_size=self.chunk_size)
    209     total = int(response.headers.get("content-length", 0))

File ~/my-conda-envs/genomic-features/lib/python3.9/site-packages/requests/models.py:1021, in Response.raise_for_status(self)
   1016     http_error_msg = (
   1017         f"{self.status_code} Server Error: {reason} for url: {self.url}"
   1018     )
   1020 if http_error_msg:
-> 1021     raise HTTPError(http_error_msg, response=self)

HTTPError: 404 Client Error: The specified blob does not exist. for url: https://bioconductorhubs.blob.core.windows.net/annotationhub/AHEnsDbs/v86/EnsDb.Hsapiens.v86.sqlite

The EnsemblDB class should have a method to check (and return?) available versions

Version information

bioframe 0.4.1 genomic_annotations 0.0.1 ibis 5.1.0 pandas 2.0.1 pooch v1.7.0 session_info 1.0.0

PIL 9.5.0 asttokens NA backcall 0.2.0 bidict 0.22.1 certifi 2022.12.07 charset_normalizer 3.1.0 cycler 0.10.0 cython_runtime NA dateutil 2.8.2 debugpy 1.6.7 decorator 5.1.1 executing 1.2.0 greenlet 2.0.2 idna 3.4 importlib_metadata NA importlib_resources NA ipykernel 6.14.0 jedi 0.18.2 kiwisolver 1.4.4 matplotlib 3.7.1 mpl_toolkits NA multipledispatch 0.6.0 numpy 1.24.3 packaging 23.1 parso 0.8.3 parsy 2.1 pexpect 4.8.0 pickleshare 0.7.5 pkg_resources NA platformdirs 3.3.0 prompt_toolkit 3.0.38 psutil 5.9.5 ptyprocess 0.7.0 public 3.1.1 pure_eval 0.2.2 pyarrow 11.0.0 pydev_ipython NA pydevconsole NA pydevd 2.9.5 pydevd_file_utils NA pydevd_plugins NA pydevd_tracing NA pygments 2.15.1 pyparsing 3.0.9 pytz 2023.3 regex 2.5.125 requests 2.28.2 rich NA six 1.16.0 sqlalchemy 2.0.10 sqlglot 11.5.7 stack_data 0.6.2 toolz 0.12.0 tornado 6.3 tqdm 4.65.0 traitlets 5.9.0 typing_extensions NA urllib3 1.26.15 wcwidth 0.2.6 xxhash NA zipp NA zmq 25.0.2 zoneinfo NA

IPython 8.4.0 jupyter_client 8.2.0 jupyter_core 5.3.0

Python 3.9.16 | packaged by conda-forge | (main, Feb 1 2023, 21:39:03) [GCC 11.3.0] Linux-4.15.0-112-generic-x86_64-with-glibc2.31

Session information updated at 2023-04-26 16:00

ivirshup commented 1 year ago

This specific version was not uploaded to annotationhub, which may apply to other versions as well. Some possible solutions:

Have a list of versions that are on annotation hub, throw an informative error if the requested one isn't on there
The sqlite db does exist in the bioconductor package: EnsDb.Hsapiens.v86, we could download that and extract it as a fallback
We could try and get the sqlite db uploaded to annotation hub

jorainer commented 1 year ago

AnnotationHub provides EnsDb sqlite databases from Ensembl release 87 on. It would be possible to create/add also older versions, but (to not increase storage demand of the AnnotationHub too much) I would only do that for selected versions - and if there is need.

ivirshup commented 1 year ago

I guess I don't have a specific use-case ATM for accessing other older versions (86 came up since it's used in the ensembldb vignette). Do you have some idea of how often the older versions are used?

It would be nice to have parity in access from python and R.

but (to not increase storage demand of the AnnotationHub too much)

This might intersect with a conversation I was just having with @lshep on the bioc slack about compression of these files. Is this something you've considered for the sqlite databases?

jorainer commented 1 year ago

honestly - I don't know which releases are predominantly used - maybe there is a "usage/download" log for AnnotationHub (pinging @lshep ).

For compressing the sqlite files - AFAIK R can not read from gzipped SQLite files, so the files (if compressed) would need to be unzipped locally first (could be something that AnnotationHub could actually also do on the fly?).

scverse / genomic-features