This task is under the context of the unification of the Gentropy and Platform data pipelines (#3394). It goes along #3349.
Description of the work
The main idea is to adapt PIS to be able to be run along with Gentropy. It should be runnable inside an Airflow DAG. This will require many changes which we detail now, and is also a good opportunity to modernize the application.
Current state of PIS
Currently PIS is a sequential tool which runs step by step (drug entity, target entity, etc). Inside each of the steps, the files are downloaded sequentially too.
Its codebase is aged and there are difficulties to update the dependencies. It is built around Yapsi, a plugin management toolkit which latest version is from 2019. This makes it almost impossible to update to the latest python (3.12), as it depends on distutils, which is deprecated and has been removed now.
Validation for the downloads is not being done.
Many steps are performing transformation tasks which are not in the scope of a tool that downloads files.
Some errors are not being handled properly: for example, the one in the function that finds the latest file in a bucket. We recently found a problem where the output was not correct because that error did not stop the execution flow so the process completed with missing files.
Tests are very sparse, coverage is low and the way many methods are implemented make them extremely hard to test.
The minimum we need is to be able to launch PIS inside of Airflow. This is easy to achieve by just cramming everything into a cloud run instance and hoping for the best. But even that requires some refactoring to be able to pass credentials in a safe way and fix some errors.
Since we are expecting to join the platform data pipeline with Gentropy's, there are a few more requirements. In particular, PIS should become a tool that runs for a step in an atomic way. If it fails, it must fail completely. Changing this requires a major refactor of the application, and while we are at it, we can define a set of requirements.
Single source of truth for the step definition and configuration of the application
Keep the plugin-like nature of the application without depending on Yapsi
Better structure for the step definition using common building blocks (download a file, find the latest file, etc)
Parallel downloads
Use modern python
Improve logging
Limit external dependencies
Enforcing standards for linting and formatting
Tasks and current status
[x] Figure out how to split the steps into common tasks
[x] Design the new application structure
[x] Decide on external library dependencies
[x] Define a scaffold with linting, formatting, debugger settings
[x] Packaging
[x] Implement the core
[x] Implement tasks
[x] Implement validator
[x] Unit tests
[x] Add docstrings and update README
[x] Figure out the optimal way to launch PIS instances
The new PIS is now ready in a fork. All that remains is to merge it back into the original repo, change the image targets and update the links in the readme.
This task is under the context of the unification of the Gentropy and Platform data pipelines (#3394). It goes along #3349.
Description of the work
The main idea is to adapt PIS to be able to be run along with Gentropy. It should be runnable inside an Airflow DAG. This will require many changes which we detail now, and is also a good opportunity to modernize the application.
Current state of PIS
Currently PIS is a sequential tool which runs step by step (
drug
entity,target
entity, etc). Inside each of the steps, the files are downloaded sequentially too.Its codebase is aged and there are difficulties to update the dependencies. It is built around Yapsi, a plugin management toolkit which latest version is from 2019. This makes it almost impossible to update to the latest python (3.12), as it depends on
distutils
, which is deprecated and has been removed now.Validation for the downloads is not being done.
Many steps are performing transformation tasks which are not in the scope of a tool that downloads files.
Some errors are not being handled properly: for example, the one in the function that finds the latest file in a bucket. We recently found a problem where the output was not correct because that error did not stop the execution flow so the process completed with missing files.
Tests are very sparse, coverage is low and the way many methods are implemented make them extremely hard to test.
Configuration is spread in various places: config.yaml, logger config, terraform variables for the deployment, some variables in the Makefile, some other in the scripts, and in the profiles.
Desired state
The minimum we need is to be able to launch PIS inside of Airflow. This is easy to achieve by just cramming everything into a cloud run instance and hoping for the best. But even that requires some refactoring to be able to pass credentials in a safe way and fix some errors.
Since we are expecting to join the platform data pipeline with Gentropy's, there are a few more requirements. In particular, PIS should become a tool that runs for a step in an atomic way. If it fails, it must fail completely. Changing this requires a major refactor of the application, and while we are at it, we can define a set of requirements.
Tasks and current status
Acceptance tests