The aim of this PR is to unify existing dags (except the genetics_etl) to reuse existing approach for generate_dag logic implemented for the genetics_etl that creates the topology of the dag based on configuration file.
This process streamlines the dependency management and allows for better understanding of the dependencies between the DAG steps.
Previous implementation had configuration distributed accross multiple files in the config directory. This way the configuration was not isolated for each DAG, resulting in heavy lookup into the nested structures of the configs and dags code to understand the overall processes.
By merging configuration of multiple gentropy steps and extracting this config as a single entity called dag config that is stored under the src/ot_orchestration/dags/config/*.yaml should increase the readability and verbosity of each process.
Enabiling the nodes and prerequisites in most cases allows to skip on reading the logic of the DAG itself and focus on the process definition maintained in the dag config.
Things implemented:
Refactoring of ukb_ppp_eur_harmonisation DAG
Refactoring of gwas_curation_update DAG
Refactoring of gwas_catalog_preprocess DAG
Refactoring of gnomad_ingestion DAG
Deprecation of gwas_catalog_harmonisation DAG -> the content is under development of gwas_catalog_pipeline DAG
Refactoring of finngen_ukb_meta_harmonisation DAG
Refactoring of finngen_ingestion DAG + addition of extra parameter sample_size
Refactoring of eqtl_ingestion DAG.
New bunch of tests for utils
Refactoring of dataproc releated functions.
Fixes to development process
Allow for shell to be inferred from env variable, so after running make dev the bashrc file is not populated with junk lines,
Remove sourcing of poetry shell as default from setup script,
Fix duplicate pre-commit call that causes the pre-commit to run twice
Context
This PR closes https://github.com/opentargets/orchestration/issues/27
The aim of this PR is to unify existing dags (except the genetics_etl) to reuse existing approach for
generate_dag
logic implemented for thegenetics_etl
that creates the topology of the dag based on configuration file.This process streamlines the dependency management and allows for better understanding of the dependencies between the DAG steps.
Previous implementation had configuration distributed accross multiple files in the
config
directory. This way the configuration was not isolated for each DAG, resulting in heavy lookup into the nested structures of the configs and dags code to understand the overall processes.By merging configuration of multiple gentropy steps and extracting this config as a single entity called
dag config
that is stored under thesrc/ot_orchestration/dags/config/*.yaml
should increase the readability and verbosity of each process. Enabiling the nodes and prerequisites in most cases allows to skip on reading the logic of the DAG itself and focus on the process definition maintained in thedag config
.Things implemented:
ukb_ppp_eur_harmonisation
DAGgwas_curation_update
DAGgwas_catalog_preprocess
DAGgnomad_ingestion
DAGgwas_catalog_harmonisation
DAG -> the content is under development ofgwas_catalog_pipeline
DAGfinngen_ukb_meta_harmonisation
DAGfinngen_ingestion
DAG + addition of extra parametersample_size
eqtl_ingestion
DAG.dataproc
releated functions.make dev
the bashrc file is not populated with junk lines,