precimed / mixer

Causal Mixture Model for GWAS summary statistics
GNU General Public License v3.0
65 stars 18 forks source link
gwas linkage-disequilibrium population-genetics summary-statistics

Contents

Introduction

This folder contains a Python port of MiXeR, wrapping the same C/C++ core as we previously used from MATLAB. This is work in progress, but eventually it should singificantly improve user experience, as Python allows much simpler installation procedures, makes it less error prone, allows to implement well-documented command-line interface (python mixer.py --help), and provide visualization.

Input data for MiXeR consists of summary statistics from a GWAS, and a reference panel. MiXeR format for summary statistics is compatible with LD Score Regression (i.e. the sumstats.gz files), and for those users who are already familiar with munge_sumstats.py script we recommend to use LD Score Regression pipeline to prepare summary statistics. At the same time, we encourage everyone to take a look our own pipeline for processing summary statistics. For the reference panel we recommend to use 1000 Genomes Phase3 data, pre-processed according to LD Score Regression pipeline, and available for download from LDSC website. Further details are given in Data downloads and Data preparation sections.

Once you have all input data in MiXeR-compatible format you may proceed with running univariate (fit1, test1) and cross-trait (fit2, test2) analyses, as implemented in mixer.py command-line interface. The results will be saved as .json files. To visualize the results we provide a script in python, but we encourage users to write their own scripts that understand the structure of .json files, process the results. Further details are given in Run MiXeR, MiXeR options and Visualize MiXeR results sections.

If you encounter an issue, or have further questions, please create a new issue ticket.

If you use MiXeR software for your research publication, please cite the following paper(s):

MiXeR versions:

Install MiXeR

Singularity containers

Update May 2021: MiXeR v1.3 is now available as a singularity container. See mixer_simu.md and mixer_real.md for usage examples. If you are new to singularity containers, please refer to hello.md to get started.

The steps below provide an alternative setup (without singularity containers).

Prerequisites

If you are an experienced C++ programmer it shouldn't be difficult to compile MiXeR core in MAC or Windows. If you've made it work, please share your changes via pull request.

Hardware requirements

MiXeR software is very CPU intensive. Minimal memory requirement is to have 32 GB of RAM available to MiXeR. MiXeR efficiently uses multiple CPUs. We recommend to run MiXeR on a system with at least 16 physical cores.

Install on Linux using pre-built binaries

Not available yet.

Build from source - Linux

The exact steps depend on your build environment.

Data downloads

Data preparation

Run MiXeR

Univariate analysis

Fit the model:

python3 <MIXER_ROOT>/precimed/mixer.py fit1 \
      --trait1-file SSGAC_EDU_2018_no23andMe_noMHC.csv.gz \
      --out SSGAC_EDU_2018_no23andMe_noMHC.fit.rep${SLURM_ARRAY_TASK_ID} \
      --extract LDSR/1000G_EUR_Phase3_plink/1000G.EUR.QC.prune_maf0p05_rand2M_r2p8.rep${SLURM_ARRAY_TASK_ID}.snps \
      --bim-file LDSR/1000G_EUR_Phase3_plink/1000G.EUR.QC.@.bim \
      --ld-file LDSR/1000G_EUR_Phase3_plink/1000G.EUR.QC.@.run4.ld \
      --lib  <MIXER_ROOT>/src/build/lib/libbgmg.so \

Apply the model to the entire set of SNPs, without constraining to LDSR/w_hm3.justrs:

python3 <MIXER_ROOT>/precimed/mixer.py test1 \
      --trait1-file SSGAC_EDU_2018_no23andMe_noMHC.csv.gz \
      --load-params-file SSGAC_EDU_2018_no23andMe_noMHC.fit.rep${SLURM_ARRAY_TASK_ID}.json \
      --out SSGAC_EDU_2018_no23andMe_noMHC.test.rep${SLURM_ARRAY_TASK_ID} \
      --bim-file LDSR/1000G_EUR_Phase3_plink/1000G.EUR.QC.@.bim \
      --ld-file LDSR/1000G_EUR_Phase3_plink/1000G.EUR.QC.@.run4.ld \
      --lib  <MIXER_ROOT>/src/build/lib/libbgmg.so \

The results will be saved <out_file>.json file. Repeat the above analysis for the second trait (PGC_SCZ_2014_EUR_qc_noMHC.csv.gz).

To visualize the results:

python precimed/mixer_figures.py combine --json PGC_SCZ_2014_EUR_qc_noMHC.fit.rep@.json --out combined/PGC_SCZ_2014_EUR.fit
python precimed/mixer_figures.py one --json PGC_SCZ_2014_EUR.json --out PGC_SCZ_2014_EUR --statistic mean std

Bivariate (cross-trait) analysis

Fit the model:

python3 <MIXER_ROOT>/python/mixer.py fit2 \
      --trait1-file PGC_SCZ_2014_EUR_qc_noMHC.csv.gz \
      --trait2-file SSGAC_EDU_2018_no23andMe_noMHC.csv.gz \
      --trait1-params-file PGC_SCZ_2014_EUR_qc_noMHC.fit.rep${SLURM_ARRAY_TASK_ID}.json \
      --trait2-params-file SSGAC_EDU_2018_no23andMe_noMHC.fit.rep${SLURM_ARRAY_TASK_ID}.json \
      --out PGC_SCZ_2014_EUR_qc_noMHC_vs_SSGAC_EDU_2018_no23andMe_noMHC.fit.rep${SLURM_ARRAY_TASK_ID} \
      --extract LDSR/1000G_EUR_Phase3_plink/1000G.EUR.QC.prune_maf0p05_rand2M_r2p8.rep${SLURM_ARRAY_TASK_ID}.snps \
      --bim-file LDSR/1000G_EUR_Phase3_plink/1000G.EUR.QC.@.bim \
      --ld-file LDSR/1000G_EUR_Phase3_plink/1000G.EUR.QC.@.run4.ld \
      --lib  <MIXER_ROOT>/src/build/lib/libbgmg.so \

Apply the model to the entire set of SNPs, without constraining to LDSR/w_hm3.justrs:

python3 <MIXER_ROOT>/python/mixer.py test2 \
      --trait1-file PGC_SCZ_2014_EUR_qc_noMHC.csv.gz \
      --trait2-file SSGAC_EDU_2018_no23andMe_noMHC.csv.gz \
      --load-params-file PGC_SCZ_2014_EUR_qc_noMHC_vs_SSGAC_EDU_2018_no23andMe_noMHC.fit.rep${SLURM_ARRAY_TASK_ID}.json \
      --out PGC_SCZ_2014_EUR_qc_noMHC_vs_SSGAC_EDU_2018_no23andMe_noMHC.test.rep${SLURM_ARRAY_TASK_ID} \
      --bim-file LDSR/1000G_EUR_Phase3_plink/1000G.EUR.QC.@.bim \
      --ld-file LDSR/1000G_EUR_Phase3_plink/1000G.EUR.QC.@.run4.ld \
      --lib  <MIXER_ROOT>/src/build/lib/libbgmg.so \

Note that these parameters point to the results of univariate analysis for both traits, so those must be generated first. The results will be saved <out_file>.json file.

To visualize the results (where <prefix> is, for example, PGC_SCZ_2014_EUR_qc_noMHC_vs_SSGAC_EDU_2018_no23andMe_noMHC):

python precimed/mixer_figures.py combine --json <prefix>.fit.rep@.json --out <prefix>.fit
python precimed/mixer_figures.py combine --json <prefix>.test.rep@.json --out <prefix>.test
python precimed/mixer_figures.py two --json-fit <prefix>.fit.json --json-test <prefix>.test.json --out <out_file> --statistic mean std

MiXeR options

Run --help commands to list available options and their description.

python3 mixer.py ld --help
python3 mixer.py snps --help
python3 mixer.py fit1 --help
python3 mixer.py test1 --help
python3 mixer.py fit2 --help
python3 mixer.py test2 --help
python3 mixer.py perf --help

Visualize MiXeR results

First step is to average the results across 20 runs with mixer_figures.py combine, which works for univariate (fit1, test1) and bivariate (fit2, test2) runs:

python precimed/mixer_figures.py combine --json <prefix>.fit.rep@.json --out <prefix>.fit

The resulting .json files can be converted to figures and .csv tables via the following commands (one for univariate, two for bivariate; each of these commands accept .json files from fit and test steps).

python precimed/mixer_figures.py one --json <out_file>.json --out <out_file>
python precimed/mixer_figures.py two --json <out_file>.json --out <out_file>
python precimed/mixer_figures.py two --json-fit <out_file_fit>.json --json-test <out_file_test>.json --out <out_file>

For the two command, instead of --json, it is possible to specify --json-fit and --json-test separately. This allows to combine negative log-likelihood plot (available in fit2 only) and QQ plots (available in test2 only). Note that all --json accept wildcards (*) or a list of multiple files. This allows to generate .csv tables containing results from multiple MiXeR runs.

MiXeR results format

MiXeR produces the following results, as described in the original publication.

These output is described in the cross-trait MiXeR publication.

AIC BIC interpretation

.csv files generated by python precimed/mixer_figures.py commands contain AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion) values. To generate AIC / BIC values you should point mixer_figures.py to json files produced by fit1 or fit2 steps (not those from test1 or test2 steps).

The idea of model selection criteia (both AIC and BIC) is to find whether the input data (in our case the GWAS summary statistics) have enough statistical power to warrant a more complex model - i.e. a model with additional free parameters that need to be optimized from the data. For example, the LDSR model has two free parameters - the slope and an intercept. The univariate MiXeR model has three parameters (pi, sig2_beta and sig2_zero). Naturally, having an additional free parameters allows MiXeR to fit the GWAS data better compared to LDSR model, however it needs to be substantially better to justify an additional complexity. AIC and BIC formalize this trade-off between model's complexity and model's ability to describe input data.

The difference between AIC and BIC is that BIC is a more conservative model selection criterion. Based on our internal use of MiXeR, it is OK discuss the resulsult if only AIC supports MiXeR model, but it's important point as a limitation that the input signal (GWAS) has borderline power to fit MiXeR model.

For the univariate model, AIC / BIC values are described in the cross-trait MiXeR paper (ref). A negative AIC value means that there is not enough power in the input data to justify MiXeR model as compared to LDSR model, and we do do not recommend applying MiXeR in this situation.

For the bivariate model the resulting table contains two AIC, and two BIC values, named as follows: best_vs_min_AIC, best_vs_min_BIC , best_vs_max_AIC, and best_vs_max_BIC. They are explained below - but you may need to develop some intuition to interpret these numbers. Consider taking a look at the figure shown below, containing negative log-likelihood plots. Similar plots were presented in ref, supplementary figure 19. We use such likelihood-cost plots to visualise the performance of the best model vs min and max.

First, let's interpret best_vs_max_AIC value. It uses AIC to compare two models: the best model with polygenic overlap that is shown in the venn diagram (i.e. fitted by MiXeR), versus the max model with maximum possible polygenic overlap given trait's genetic architecture (that is, in the max model the causal variants of the least polygenic trait form a subset of the causal variants in the most polygenic trait). A positive value of best_vs_max_AIC means that best model explains the observed GWAS signal better than max model, despite its additional complexity (here additional complexity comes from the fact that MiXeR has to find an actual value of the polygenic overlap, i.e. estimate the size of the grey area on the venn diagram).

Similarly, best_vs_min_AIC compares the best model versus model with minimal possible polygenic overlap. It is tempting to say that minimal possible polygenic overlap means the same as no shared causal variants, i.e. a case where circles on a venn diagram do not overlap. However, there is a subtle detail in our definition of the minimal possible polygenic overlap in cases where traits show some genetic correlation. Under the assumptions of cross-trait MiXeR model, the presence of genetic correlation imply non-empty polygenic overlap. In our AIC / BIC analysis, we constrain the best model to a specific value of the genetic correlation, obtained by the same procedure as in LDSR model (i.e. assuming an infinitesimal model, as described in ref. In this setting, minimal possible polygenic overlap corresponds to pi12 = rg * sqrt(pi1u * pi2u). best_vs_min_AIC is an important metric - a positive value indicates that the data shows support for existense of the polygenic overlap, beyond the minimal level need to explain observed genetic correlation between traits.

Finally, best_vs_max_BIC and best_vs_max_BIC have the same meaning as the values explained above, but using more stringent BIC criterion instead of AIC.

The last panel on the following figure shows negative log-likelihood plot as a function of polygenic overlap. For the general information about maximum likelihood optimization see here. In our case we plot negative log-likelihood, that's we search for the minimum on the curve (this sometimes is refered to as a cost function, with objective to find parameters that yield the lowest cost).

On the negative log-likelihood plot, the min model is represented by the point furthest to the left, max furthest to the right and best is the lowest point of the curve. (off note, in some other cases log-likelihood plot can be very noisy - then best is still the lowest point, but it doesn't make practical sence as it is just a very noisy estimate - and this is exactly what we want to clarify with AIC / BIC).

The minimum is obtained at n=1.3K shared causal variants - that's our best model, scores at about 25 points (lower is better). A model with least possible overlap has n=0 shared causal variants - that our min model, scored at 33 points. Finally, a model with largest possible overlap has n=4K shared causal variants - that our max model, scores at 50 points. We use AIC (and BIC) to compare best model versus the other models.

GIANT_HEIGHT_2018_UKB_vs_PGC_BIP_2016 json

Upgrade nodes from MiXeR v1.2 to v1.3