jwr-git / pwcoco

Pair-wise conditional analysis and colocalisation
GNU General Public License v3.0
36 stars 4 forks source link

Pair-Wise Conditional analysis and Colocalisation analysis (PWCoCo)

A C++ implementation of the PWCoCo algorithm first described by Zheng, et al in their paper, Phenome-wide Mendelian randomization mapping the influence of the plasma proteome on complex diseases.

This tool integrates methods from GCTA-COJO and the coloc R package.

Citation

Please cite our pre-print!

@article {Robinson2022.08.08.503158,
    author = {Robinson, Jamie W and Hemani, Gibran and Babaei, Mahsa Sheikhali and Huang, Yunfeng and Baird, Denis A and Tsai, Ellen A and Chen, Chia-Yen and Gaunt, Tom R and Zheng, Jie},
    title = {An efficient and robust tool for colocalisation: Pair-wise Conditional and Colocalisation (PWCoCo)},
    elocation-id = {2022.08.08.503158},
    year = {2022},
    doi = {10.1101/2022.08.08.503158},
    publisher = {Cold Spring Harbor Laboratory},
    URL = {https://www.biorxiv.org/content/early/2022/08/08/2022.08.08.503158},
    eprint = {https://www.biorxiv.org/content/early/2022/08/08/2022.08.08.503158.full.pdf},
    journal = {bioRxiv}
}

Requirements

Additional Libraries:

These additional libraries are bundled in /include/ at the required versions for ease of the user.

How to Build

Currently, only Unix and Windows are supported.

Unix

To build on Unix systems, clone this repository and follow the code below:

mkdir build
cd build
cmake ..
make

If building on the University of Bristol's HPC, load the module languages/gcc-9.1.0 and ensure this is the only gcc module loaded. Also, if you do not have a cmake module loaded, please load, for example, tools/cmake-3.13.4. This should be all you need to build the program.

Windows

A .sln file is provided for Visual Studio 2019. Be aware OpenMP may not be enabled by default on VS and may need to be enabled manually.

How to Use

PWCoCo is a command-line program. Here is a list of accepted flags with a description of each one:

Required

For acceptable formats for these files, please see below.

Optional

PWCoCo makes use of OpenMP to parallelise some tasks. This can greatly increase the performance of the tool and decrease the time required to run. It is advisable to use a compiler that utilises OpenMP version 3.0 (which is sadly not yet supported by Visual Studio). Furthermore, allowing the tool to make use of more threads should improve performance, especially with regards to the reference data loading. The reference panel loading and operations are the most intensive in the tool, so larger panels will require longer to parse -- in these instances, it would be preferable to use more threads so that performance is not greatly impacted.

Example

Please see the Wiki for an example. The files to run this example are provided in /data/.

Input

The reference files must be in Plink format (specified using the --bfile flag). This means a .bed, .bim and .fam file in the same directory with the same name. Including the file ending is not required for PWCoCo to access these.

There are two options and cases for the user as to how they provide their summary statistic files to the program. The --sum_stats flags can take a path to either a folder or a file. Both cases are explained below. In both cases, file endings do not particularly matter (so long as they are readable by PWCoCo) and delimiter also does not particularly matter (PWCoCo will attempt to determine the delimiter between tabs, commas or spaces).

Case 1 - Few analyses

If only a few analyses are required to be run (< 100, for example) then it is more efficient to run PWCoCo separately for each of the file pairs (e.g. exposure vs outcome). In that case, using the --sum_stats1 and --sum_stats2 flags to point to the summary statistic files is better, as it allows the user to specify flags which will speed up the reference data loading (e.g. --chr).

Case 2 - Many analyses - BETA FEATURE

If PWCoCo will be analysing many different datasets (> 100, for example) then it may be more efficient to point the --sum_stats flags to two separate folder locations (e.g. one folder for exposure data, another for outcome data). In this case, PWCoCo will look for files with the same names and file endings in the two folders and use these as the first and second summary statistic files. If this is done, PWCoCo will load the entire reference panel into memory (and therefore will require a large amount of RAM) to use between analyses.

This is a beta feature and may still be buggy. Please do not use this feature to run real analyses, and report any bugs you may experience.

Notes and Tips for Efficiency

By far and away the slowest part of the program is loading, cleaning and preparing the reference data. Furthermore, as more samples or SNPs are added to the reference, so too does the complexity of the program increase. Therefore, PWCoCo offers the option to retain the reference data in memory between analyses, which, when conducting many analyses, can save time. However, even then, analyses may run slowly in which case we advise to bear in mind the following points:

At the end of the day, PWCoCo is still built with performance and efficiency in mind and so will be more performant than other, similar methods; however, these steps should be in the mind of the user who wishes to conduct potentially thousands or more analyses to squeeze even more efficiency out of the tool.

Input File Formats

Phenotype files do not require a certain file format. Instead, they must follow this structure:

SNP A1 A2 A1_freq beta se p {n {case}}

Column names do not matter, only the order of the data. The n and case columns are optional and should be given for phenotypes which are measured in case/control studies - where n will be the total sample size and case only case numbers. Including this column will cause the colocalisation to treat this as cc typed data (and not quant which only has n, total sample size, available). These may also be provided through the command line arguments.

Output File Formats

The program by default will output a file with the ending .coloc which contains the results for each of the colocalisation analyses run:

Dataset1 Dataset2 SNP1 SNP2 nsnps H0 H1 H2 H3 H4 log_abf_all

If the data has been unconditioned, then the SNP column will contain "unconditioned" instead of a SNP name. Please note that output files are not deleted or overwritten between runs. That means if you run the program twice with the same output file name, results will be appended to the output file. SNPs correspond to the same numbered phenotype file, e.g. SNP1 comes from sum_stats1.

The program will also output two extra files (ending in .included) listing the SNPs included in the analysis from each dataset. Finally, if the out_cond flag is set to true, the program will output the conditioned results after the conditional analysis in files ending with .cojo:

Chr SNP bp refA freq b se p n freq_geno bC bC_se pC rs12345

The columns freq, b, se and p should be unaltered from the original dataset. The columns postfixed "C" will be post-conditional analysis. The final column will be named after the lead SNP of the association signal and will show the LD (r2) between the column SNP and the row SNP. The lead SNP will also be used in the name of the file to differentiate it from other conditional files from the same dataset.

To Do