As a user I want to be able to run the ETL pipeline easily because it currently involves writing configs, building jars and copying files to GCS before hitting run.
Background
The current process is:
checkout a branch for the release
create a config for the etl
copy the config to GCS
build the etl jar
copy the etl jar to GCS
create the config for the workflow jar
build the workflow jar
run the workflow
These are all manually done each run. The main input variables we need to be able to control are:
platform/data release version e.g. 23.12
chembl version e.g. 33
ensembl version e.g. 110
is public - boolean
datasources to exclude
Tasks
[x] makefile to run the workflow.
[x] profile like PIS/POS for capturing input variables
As a user I want to be able to run the ETL pipeline easily because it currently involves writing configs, building jars and copying files to GCS before hitting run.
Background
The current process is:
These are all manually done each run. The main input variables we need to be able to control are:
Tasks
Acceptance tests
How do we know the task is complete?