opentargets / orchestration

Open Targets pipeline orchestration layer
Apache License 2.0
1 stars 0 forks source link

refactor: genetic dags #26

Closed project-defiant closed 2 months ago

project-defiant commented 2 months ago

Context

This PR closes https://github.com/opentargets/orchestration/issues/27

The aim of this PR is to unify existing dags (except the genetics_etl) to reuse existing approach for generate_dag logic implemented for the genetics_etl that creates the topology of the dag based on configuration file.

This process streamlines the dependency management and allows for better understanding of the dependencies between the DAG steps.

Previous implementation had configuration distributed accross multiple files in the config directory. This way the configuration was not isolated for each DAG, resulting in heavy lookup into the nested structures of the configs and dags code to understand the overall processes.

By merging configuration of multiple gentropy steps and extracting this config as a single entity called dag config that is stored under the src/ot_orchestration/dags/config/*.yaml should increase the readability and verbosity of each process. Enabiling the nodes and prerequisites in most cases allows to skip on reading the logic of the DAG itself and focus on the process definition maintained in the dag config.

Things implemented:

  1. Refactoring of ukb_ppp_eur_harmonisation DAG
  2. Refactoring of gwas_curation_update DAG
  3. Refactoring of gwas_catalog_preprocess DAG
  4. Refactoring of gnomad_ingestion DAG
  5. Deprecation of gwas_catalog_harmonisation DAG -> the content is under development of gwas_catalog_pipeline DAG
  6. Refactoring of finngen_ukb_meta_harmonisation DAG
  7. Refactoring of finngen_ingestion DAG + addition of extra parameter sample_size
  8. Refactoring of eqtl_ingestion DAG.
  9. New bunch of tests for utils
  10. Refactoring of dataproc releated functions.
  11. Fixes to development process
    • Allow for shell to be inferred from env variable, so after running make dev the bashrc file is not populated with junk lines,
    • Remove sourcing of poetry shell as default from setup script,
    • Fix duplicate pre-commit call that causes the pre-commit to run twice
project-defiant commented 2 months ago

@javfg ready to review!