There's something off with distance based clumping.

Distance based clumping can be executed in two ways: with or without collecting the locus around the identified semi index. Both processes apply a shared clumping step, which is optinally folled by joining back to the source summary statistics. This logic assumes the number of resulting semi indices are the same regardless the locus collection. Sadly this is not the case apparently...

# Initialize session:
session = Session()

# Run parameters:
clumping_window = 500_000
locus_window = 250_000

# Sample GWAS dataset:
sample_dataset = 'gs://open-targets-gwas-summary-stats/studies/GCST000758'
gwas_sumstats = SummaryStatistics.from_parquet(session, sample_dataset)

# The bug:
# Retruns 36 
(
    gwas_sumstats
    .window_based_clumping(
        distance=clumping_window, 
        with_locus=False, 
    )
    .df.count()
) 

# Retruns 33
(
    gwas_sumstats
    .window_based_clumping(
        distance=clumping_window, 
        with_locus=True,
        locus_collect_distance=locus_window
    )
    .df.count()
)

opentargets / issues

There's something off with distance based clumping. #3100