opentargets / orchestration

Open Targets pipeline orchestration layer
Apache License 2.0
1 stars 0 forks source link

feat(gwas catalog sumstats): finemapping #51

Closed project-defiant closed 4 weeks ago

project-defiant commented 1 month ago

Context

We want to perform locus breaker clumping and SuSiE finemapping on harmonised summary statistics comming from GWAS Catalog.

Implementations

This PR implements:

[!NOTE] Locus Breaker Clumping performance The performance of LB clumping was not ideal. The step took ~2h to compute the StudyLocus starting from 69K harmonised summary statistics. See dataproc job. This situation is a partially the result of the largely distributed dataset - see the first spike in nodes representing the first job to list all parquet files in subdirectories.

image The number of loci resulted from clumping oscilated ~440K.

Running code with this branch we were able to perform the fine-mapping of the 441k loci in 7h.

The way how the finemapping works:

  1. list all loci outputed from locus breaker
  2. list all log files from previous finemapping runs
  3. make a diff and submit the jobs

This approach is not ideal due to the number of google API calls (knowledge post mortem - see distrubution of the calls in the buckets on 23rd of October) ) we need to make when running list.objects, the better solution would be to:

  1. generate the manifests
  2. submit the batch jobs in consecutive order depending on the manifest
  3. cache the information if the manifest was used by finemapping job or not

This could be implemented as an enhancement in the future.

project-defiant commented 1 month ago

@addramir ignore the docs for now. I am changing them as soon as the dag is successful

project-defiant commented 4 weeks ago

@DSuveges thank you!