We want to perform locus breaker clumping and SuSiE finemapping on harmonised summary statistics comming from GWAS Catalog.
Implementations
This PR implements:
[x] Dag for performing locus_breaker_clumping and study_index generation for GWAS Catalog summary statistics.
[x] Dag for finemapping clumped results.
[x] Next iteration of documentation for the GWAS Catalog.
[x] Refactored finemapping operator to take into accout already finemapped loci and limit for the batch jobs to call
[!NOTE]
Locus Breaker Clumping performance
The performance of LB clumping was not ideal. The step took ~2h to compute the StudyLocus starting from 69K harmonised summary statistics. See dataproc job.
This situation is a partially the result of the largely distributed dataset - see the first spike in nodes representing the first job to list all parquet files in subdirectories.
The number of loci resulted from clumping oscilated ~440K.
Running code with this branch we were able to perform the fine-mapping of the 441k loci in 7h.
The way how the finemapping works:
list all loci outputed from locus breaker
list all log files from previous finemapping runs
make a diff and submit the jobs
This approach is not ideal due to the number of google API calls (knowledge post mortem - see distrubution of the calls in the buckets on 23rd of October) ) we need to make when running list.objects, the better solution would be to:
generate the manifests
submit the batch jobs in consecutive order depending on the manifest
cache the information if the manifest was used by finemapping job or not
This could be implemented as an enhancement in the future.
Context
We want to perform locus breaker clumping and SuSiE finemapping on harmonised summary statistics comming from GWAS Catalog.
Implementations
This PR implements:
locus_breaker_clumping
andstudy_index
generation for GWAS Catalog summary statistics.Running code with this branch we were able to perform the fine-mapping of the 441k loci in 7h.
The way how the finemapping works:
This approach is not ideal due to the number of google API calls (knowledge post mortem - see distrubution of the calls in the buckets on 23rd of October) ) we need to make when running
list.objects
, the better solution would be to:This could be implemented as an enhancement in the future.