opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal
https://platform.opentargets.org https://genetics.opentargets.org
Apache License 2.0
12 stars 2 forks source link

platform-input-support modernization #3382

Closed javfg closed 3 weeks ago

javfg commented 3 months ago

This task is under the context of the unification of the Gentropy and Platform data pipelines (#3394). It goes along #3349.

Description of the work

The main idea is to adapt PIS to be able to be run along with Gentropy. It should be runnable inside an Airflow DAG. This will require many changes which we detail now, and is also a good opportunity to modernize the application.

Current state of PIS

Currently PIS is a sequential tool which runs step by step (drug entity, target entity, etc). Inside each of the steps, the files are downloaded sequentially too.

Its codebase is aged and there are difficulties to update the dependencies. It is built around Yapsi, a plugin management toolkit which latest version is from 2019. This makes it almost impossible to update to the latest python (3.12), as it depends on distutils, which is deprecated and has been removed now.

Validation for the downloads is not being done.

Many steps are performing transformation tasks which are not in the scope of a tool that downloads files.

Some errors are not being handled properly: for example, the one in the function that finds the latest file in a bucket. We recently found a problem where the output was not correct because that error did not stop the execution flow so the process completed with missing files.

Tests are very sparse, coverage is low and the way many methods are implemented make them extremely hard to test.

Configuration is spread in various places: config.yaml, logger config, terraform variables for the deployment, some variables in the Makefile, some other in the scripts, and in the profiles.

Desired state

The minimum we need is to be able to launch PIS inside of Airflow. This is easy to achieve by just cramming everything into a cloud run instance and hoping for the best. But even that requires some refactoring to be able to pass credentials in a safe way and fix some errors.

Since we are expecting to join the platform data pipeline with Gentropy's, there are a few more requirements. In particular, PIS should become a tool that runs for a step in an atomic way. If it fails, it must fail completely. Changing this requires a major refactor of the application, and while we are at it, we can define a set of requirements.


Tasks and current status

Acceptance tests

  1. We can plug the new PIS into the orchestration layer.
  2. PIS runs properly for all the 18 steps.
  3. The output of running the new PIS is valid.
javfg commented 3 weeks ago

The new PIS is now ready in a fork. All that remains is to merge it back into the original repo, change the image targets and update the links in the readme.