projectglow / glow

An open-source toolkit for large-scale genomic analysis
https://projectglow.io
Apache License 2.0
266 stars 111 forks source link

Glow integration test with multitask jobs and git repos integration #491

Closed williambrandler closed 2 years ago

williambrandler commented 2 years ago

Signed-off-by: William Brandler william.brandler@databricks.com

What changes are proposed in this pull request?

The Glow continuous integration notebook tests currently times out after five hours. This is because each notebook is run sequentially on a new cluster.

Workflows with multiple tasks By orchestrating the workflow with multiple tasks and cluster reuse, this workflow finishes in less than 1 hour. The multiple task jobs were defined manually in the databricks UI and exported as json docs/dev/multitask-integration-test-config.json.

Important when you export a json from multitask jobs, please remove settings{ } from the json to avoid this error: "error_code":"INVALID_PARAMETER_VALUE","message":"Job settings must be specified."

Github CI/CD integration with Repos Notebooks are now synced directly from the Glow Github Repository using Repos rather than uploading them with the Databricks CLI into a temporary directory. This will make it easier in the future to integrate with Terraform to do entire deployments of Databricks with Glow pipelines predefined and set up ready to go.

With this new setup, notebook tests will run on your branch of your fork.

Note: this new integration test does not yet include all notebooks in the Glow repository (only 20 / 36). And it uses four different cluster configurations (see screenshots below)

Future work: Some of the notebooks in the repository are now redundant, these will be removed in future. And other notebooks will be included into the integration test (such as VEP & liftOver).

The next work will be to optimize the cluster configurations and workflow for UK Biobank scale test data with one phenotype. Then for multiple phenotypes.

Screen Shot 2022-02-25 at 4 42 15 PM Screen Shot 2022-02-25 at 4 42 43 PM

How is this patch tested?

codecov[bot] commented 2 years ago

Codecov Report

Merging #491 (ea23c3b) into master (5552640) will not change coverage. The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #491   +/-   ##
=======================================
  Coverage   93.66%   93.66%           
=======================================
  Files          95       95           
  Lines        4875     4875           
  Branches      457      457           
=======================================
  Hits         4566     4566           
  Misses        309      309           

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 5552640...ea23c3b. Read the comment docs.