Implement parallel finemapping computation

tskir commented 6 months ago

This epic is to track progress on implementing a prototype of parallel finemapping computation.

tskir commented 6 months ago

The prototype is currently being developed in this repository. So far, the following runs have been done:

v1, 2024-04-23: Dummy Google Batch with only dependency installation and no business logic payload, successful
v2, 2024-04-23: Test run with 32 rows, successful
v3, 2024-04-23: Run on 5k rows with ±100,000 window, 5564 tasks successful, 3 failed due to https://github.com/opentargets/issues/issues/3297, which is being worked on
v4, 2024-04-24: Run on 1k rows with ±1,000,000 window, successful
v5, 2024-04-25: Run on 17k rows with ±1,000,000 window, 16993 tasks successful, 400 failed due to https://github.com/opentargets/issues/issues/3297 and other reasons

I'm currently finalising my investigation & configuration updates following the v5 run.

d0choa commented 6 months ago

@tskir, @ireneisdoomed and I have tried to work on how to bring your efforts to production.

Now, there is a docker image of gentropy that is generated by Github actions anytime we merge something to dev. CI/CD uploads that image and it’s available in the Google Artifact registry. We have successfully used this image to run the fine-mapping step in Google Batch. You can find the image in this path: europe-west1-docker.pkg.dev/open-targets-genetics-dev/gentropy-app/gentropy:dev.

Irene has an advanced draft of an Airflow DAG that generates the to-do list of loci to finemap and submits the Google batch job for the pending tasks (incremental pipeline). We have confirmed we can run 3 finemapping tasks by parametrising the studyLocusId in parallel following your strategy but we haven't yet fine-tuned the batch submission based on your findings (machine type, spot-machines, parallelism, etc.).

Irene will open a draft PR for the DAG and we can discuss the approach. We will need to parametrise the Google batch job based on your findings. Hopefully, these efforts will help build a wrapper around your work. I just wanted to mention this, so you can focus on running and maximising performance and don't worry too much about the productisation for now.

d0choa commented 6 months ago

@ireneisdoomed (draft) PR for reference https://github.com/opentargets/gentropy/pull/581

ireneisdoomed commented 6 months ago

Yes, I second what David has mentioned. Please, have a look at the Missing section. There are things yet to be determined, but most importantly we want to reproduce in the DAG @tskir 's findings. Once that is done, I'd like to fine map the loci in the Alzheimers study to compare performance between the dockerisation vs the more naive approach.

tskir commented 6 months ago

@d0choa @ireneisdoomed Thank you for working on the orchestration part! Following the v6 run and my investigation of its results, I have provided updates in three separate issues: https://github.com/opentargets/issues/issues/3314 for retry policy, https://github.com/opentargets/issues/issues/3315 for resource usage, and also https://github.com/opentargets/issues/issues/3316 for run monitoring facilities (this might be useful when you start doing large production runs in Airflow)

Overall I feel we are very close to a stable run configuration and we can start migrating the configuration a Docker + Airflow set up soon.

tskir commented 1 month ago

Everything is done here: the finemapping orchestration has been implemented, first as a draft, and then as increasingly more stable and production-ready solution. The final update was merging this pull request into the orchestration repo: https://github.com/opentargets/orchestration/pull/10

opentargets / issues

Implement parallel finemapping computation #3302