opentargets / orchestration

Open Targets pipeline orchestration layer
Apache License 2.0
1 stars 0 forks source link

Improvements of the genetics etl for platform integration #57

Open project-defiant opened 13 hours ago

project-defiant commented 13 hours ago

Context

Genetics etl dag described by the image below Image

should be possible to execute in two modes:

Here is the list of possible improvements that I can see, can fit the genetics_etl dag to fit the above conditions:

  1. Extract variant_to_vcf and list_nonannotated_variants as a single dataproc step

Currently varaiant_to_vcf step uses sources from the etl, namely:

As first step variant_to_vcf is run as a google batch job and second step list_nonannotated_variants is run as a standalone task (pythonOperator) can be submerged and run as a single dataproc step, we could decrese the complexity of the pipeline by removing the batch job and it's configuration, so it's more generic and easier to transfer to unified pipeline.

  1. Extract the configuration of the steps and submerge as a unified pipeline config empowered by hydra. This will roll back the steps to the way how they were handled before in gentropy - see https://github.com/opentargets/gentropy/tree/v1.7.0/config, but decomplexified by only hosting the config for the genetics_etl steps.

This would mean that we could store one config per way how we run the pipeline:

project-defiant commented 10 hours ago

@javfg These are the things we discussed summarized