opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal
https://platform.opentargets.org https://genetics.opentargets.org
Apache License 2.0
12 stars 2 forks source link

There's something off with distance based clumping. #3100

Closed DSuveges closed 1 year ago

DSuveges commented 1 year ago

Distance based clumping can be executed in two ways: with or without collecting the locus around the identified semi index. Both processes apply a shared clumping step, which is optinally folled by joining back to the source summary statistics. This logic assumes the number of resulting semi indices are the same regardless the locus collection. Sadly this is not the case apparently...

# Initialize session:
session = Session()

# Run parameters:
clumping_window = 500_000
locus_window = 250_000

# Sample GWAS dataset:
sample_dataset = 'gs://open-targets-gwas-summary-stats/studies/GCST000758'
gwas_sumstats = SummaryStatistics.from_parquet(session, sample_dataset)

# The bug:
# Retruns 36 
(
    gwas_sumstats
    .window_based_clumping(
        distance=clumping_window, 
        with_locus=False, 
    )
    .df.count()
) 

# Retruns 33
(
    gwas_sumstats
    .window_based_clumping(
        distance=clumping_window, 
        with_locus=True,
        locus_collect_distance=locus_window
    )
    .df.count()
) 
DSuveges commented 1 year ago

This problem has been solved by removing any region based filtering, which was applied only in one of the branches. The region based filtering was implemented to drop HLA regions, to make the process performant enough, but this is no longer necessary, so dropped.