opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal
https://platform.opentargets.org https://genetics.opentargets.org
Apache License 2.0
12 stars 2 forks source link

On-demand skipping of ETL step computation on pre-existing results #3090

Open mbdebian opened 9 months ago

mbdebian commented 9 months ago

As a developer, when running our Platform ETL process, I would like for the pipeline to have an option where computation for pre-existing results is skipped.

Background

Platform ETL process is a Scala pipeline that uses Spark for data processing.

When working on Open Targets Platform in the context of a release, or internal development, the pipeline is run in multiple iterations through the process, either completely or partially (specific steps).

The ETL configuration offers a mechanism to either overwrite or ignore pre-existing data, but this mechanism is wired at the Spark writer level, which means the computation always happens, and it’s only the stage where results are persisted, i.e. data output to some storage, the one that is skipped.

Desired functionality

We need a higher level of abstraction flag to indicate that we want to skip computation for those steps that have pre-existing data.

This could be implemented as a flag integrated in the concept of Concept Manager that we can later on extend with additional features related to the running ETL session.

Tasks

Acceptance tests

mbdebian commented 3 months ago

@remo87 , We'll put this one on hold until we have your performance and cost results from autoscaling the Dataproc cluster Thanks!