related-sciences / ensembl-genes

Extract the Ensembl genes catalog to simple tables
Other
17 stars 4 forks source link

Automate detection & export of new ensembl releases #1

Closed dhimmel closed 3 years ago

dhimmel commented 3 years ago

@cthoyt tweeted:

Why not automate even further? Have it check on a daily basis if Ensembl has been updated since the last release of your artifacts so even if you don’t personally manage this anymore, it can continue on. I was thinking about this a lot lately and have been accumulating scripts for checking database versions in https://github.com/biopragmatics/bioversions. I just added one for ensembl, feel free to rely on that package or deconstruct the parts that are important and include directly in your source

This is a great idea and would reduce future maintenance. Happy to use bioversions for this.

We will need to detect if an output already exists. Should be able to do this by looking at the git branches.

Sometimes exports will fail, for example if a release changes the schema. These changes take a non-trivial amount of effort to fix. For this reason I lean towards weekly scheduled jobs, so when this is failing it becomes a weekly and not daily annoyance.

cthoyt commented 3 years ago

@dhimmel thanks for making an issue. I would have done so myself but I was on the run when I tweeted at you. Here's a little more context:

The code you'd need after doing pip install bioversions is:

import bioversions

ensembl_version = bioversions.get_version("ensembl")

This code executes a live request to the Ensembl website and does some HTML parsing/traversal to pick out the version number. This actually runs on a nightly build (along with all of the other version getter functions in Bioversions) that writes to a YAML file on the Bioversions GitHub repository, so you can use this alternative code that doesn't actually rely on Bioversions as a Python dependency:

import requests
import yaml

url = "https://raw.githubusercontent.com/biopragmatics/bioversions/main/docs/_data/versions.yml"
res = requests.get(url)
res_yaml = yaml.safe_load(res.text)
versions = {
    entry["prefix"]: entry["releases"][-1]["version"]
    for entry in res_yaml["database"]
    if "prefix" in entry
}
ensembl_version = versions["ensembl"]
cthoyt commented 3 years ago

Note: I forgot that the single source of truth for the daily updated data is natively stored in JSON at https://raw.githubusercontent.com/biopragmatics/bioversions/main/src/bioversions/resources/versions.json. A better way, that doesn't rely on a YAML parser would be:

import requests

url = "https://raw.githubusercontent.com/biopragmatics/bioversions/main/src/bioversions/resources/versions.json"
res_json = requests.get(url).json()
versions = {
    entry["prefix"]: entry["releases"][-1]["version"]
    for entry in res_json["database"]
    if "prefix" in entry
}
ensembl_version = versions["ensembl"]
dhimmel commented 3 years ago

https://github.com/related-sciences/ensembl-genes/pull/3 added the JSON request approach to get the latest version. Still haven't created the scheduled CI builds. Slightly dependence on #2

dhimmel commented 3 years ago

Okay I added scheduled export builds in https://github.com/related-sciences/ensembl-genes/commit/b75c8939252c353c0ada5eeec087a955aafb2991 along with an overwrite option for whether to re-export if an output branch exists.

Both scheduled and dispatch jobs now default to overwrite=false. Must set overwrite=true on an dispatch to overwrite.

dhimmel commented 3 years ago

Here are two export CI logs

  1. one that skips export https://github.com/related-sciences/ensembl-genes/runs/3869626535?check_suite_focus=true
  2. one that exports https://github.com/related-sciences/ensembl-genes/runs/3865777576?check_suite_focus=true