project-defiant commented 1 month ago

Context

We want to perform locus breaker clumping and SuSiE finemapping on harmonised summary statistics comming from GWAS Catalog.

Implementations

This PR implements:

[x] Dag for performing locus_breaker_clumping and study_index generation for GWAS Catalog summary statistics.
[x] Dag for finemapping clumped results.
[x] Next iteration of documentation for the GWAS Catalog.
[x] Refactored finemapping operator to take into accout already finemapped loci and limit for the batch jobs to call

[!NOTE] Locus Breaker Clumping performance The performance of LB clumping was not ideal. The step took ~2h to compute the StudyLocus starting from 69K harmonised summary statistics. See dataproc job. This situation is a partially the result of the largely distributed dataset - see the first spike in nodes representing the first job to list all parquet files in subdirectories.

The number of loci resulted from clumping oscilated ~440K.

Running code with this branch we were able to perform the fine-mapping of the 441k loci in 7h.

The way how the finemapping works:

list all loci outputed from locus breaker
list all log files from previous finemapping runs
make a diff and submit the jobs

This approach is not ideal due to the number of google API calls (knowledge post mortem - see distrubution of the calls in the buckets on 23rd of October) ) we need to make when running list.objects, the better solution would be to:

generate the manifests
submit the batch jobs in consecutive order depending on the manifest
cache the information if the manifest was used by finemapping job or not

This could be implemented as an enhancement in the future.

project-defiant commented 1 month ago

@addramir ignore the docs for now. I am changing them as soon as the dag is successful

project-defiant commented 4 weeks ago

@DSuveges thank you!

opentargets / orchestration

feat(gwas catalog sumstats): finemapping #51

Context

Implementations