As a developer, when running our Platform ETL process, I would like for the pipeline to have an option where computation for pre-existing results is skipped.
Background
Platform ETL process is a Scala pipeline that uses Spark for data processing.
When working on Open Targets Platform in the context of a release, or internal development, the pipeline is run in multiple iterations through the process, either completely or partially (specific steps).
The ETL configuration offers a mechanism to either overwrite or ignore pre-existing data, but this mechanism is wired at the Spark writer level, which means the computation always happens, and it’s only the stage where results are persisted, i.e. data output to some storage, the one that is skipped.
Desired functionality
We need a higher level of abstraction flag to indicate that we want to skip computation for those steps that have pre-existing data.
This could be implemented as a flag integrated in the concept of Concept Manager that we can later on extend with additional features related to the running ETL session.
Tasks
[ ] Implement the concept of ETL Session Context Manager which first option is going to be handling a flag where computation is required to re-run, e.g. is_step_skip_preexisting, default value is false.
[ ] For every step, implement the logic needed for the step to evaluate whether pre-existing result data is present or not. This could be, at minimum, a combination of data present and not empty at the destination location, that validates the post-flight contract and some additional checks that could increase the accuracy of the decision.
[ ] When is_step_skip_preexisting is false, it means that the step needs to re-compute its processing, clearing up whatever is at the destination location to make sure no incoherent mixed results from several runs are mixed.
[ ] When is_step_skip_preexisting is true, the step will evaluate whether or not pre-existing results are present, and skip the whole computation accordingly.
Acceptance tests
[ ] When is_step_skip_preexisting and pre-existing valid data is present, computation should be skipped
[ ] In any other case, the step computation should be carried on.
As a developer, when running our Platform ETL process, I would like for the pipeline to have an option where computation for pre-existing results is skipped.
Background
Platform ETL process is a Scala pipeline that uses Spark for data processing.
When working on Open Targets Platform in the context of a release, or internal development, the pipeline is run in multiple iterations through the process, either completely or partially (specific steps).
The ETL configuration offers a mechanism to either overwrite or ignore pre-existing data, but this mechanism is wired at the Spark writer level, which means the computation always happens, and it’s only the stage where results are persisted, i.e. data output to some storage, the one that is skipped.
Desired functionality
We need a higher level of abstraction flag to indicate that we want to skip computation for those steps that have pre-existing data.
This could be implemented as a flag integrated in the concept of Concept Manager that we can later on extend with additional features related to the running ETL session.
Tasks
Acceptance tests