nf-gwas-pipeline
A Nextflow Genome-Wide Association Study (GWAS) Pipeline
Installation
Clone Repository
$ git clone https://github.com/montilab/nf-gwas-pipeline
Initalize Paths to Test Data
We have provided multiple toy datasets for testing the pipeline and
ensuring all paths and dependencies are properly setup. To set the toy
data paths to your local directory, run the following script.
$ cd nf-gwas-pipeline
$ python utils/paths.py
Download Nextflow Executable
Nextflow requires a POSIX compatible system (Linux, OS X, etc.) and Java
8 (or later, up to 11) to be installed. Once downloaded, optionally make
the nextflow file accessible by your $PATH variable so you do not have
to specify the full path to nextflow each time.
$ curl -s https://get.nextflow.io | bash
Quick Start with Docker
We have created a pre-built Docker image with all of the dependencies
installed. To get started, first make sure Docker is
installed. Then pull down the
image onto your local machine.
$ docker pull montilab/gwas:latest
or
Optionally you could build this image yourself from the Dockerfile which
specifies all of the dependencies required. Note: This might take a
while!
$ docker build --tag montilab/gwas:latest .
Run with docker
$ ./nextflow gwas.nf -c gwas.config -with-docker montilab/gwas
Expected Output
N E X T F L O W ~ version 19.04.1
Launching `gwas.nf` [jolly_fermi] - revision: 46311ebd05
-
G W A S ~ P I P E L I N E
================================
indir : <YOUR PATH>/data/
outdir : <YOUR PATH>/results
vcf : <YOUR PATH>/data/toy_vcf.csv
pheno : <YOUR PATH>/data/pheno_file_logistic.csv
snpset : <YOUR PATH>/data/snpset.txt
phenotype : outcome
covars : age,sex,PC1,PC2,PC3,PC4
model : logistic
test : Score
ref : hg19
-
[warm up] executor > local
executor > local (141)
[60/b5b95e] process > qc_miss [100%] 22 of 22 ✔
[11/fa0fbd] process > annovar_ref [100%] 1 of 1 ✔
[8f/25f8fa] process > qc_mono [100%] 22 of 22 ✔
[82/069a6d] process > vcf_to_gds [100%] 22 of 22 ✔
[3e/819e86] process > merge_gds [100%] 1 of 1 ✔
[c3/f23390] process > nullmod_skip_pca_grm [100%] 1 of 1 ✔
[ed/91344b] process > gwas_skip_pca_grm [100%] 22 of 22 ✔
[b4/3aea3e] process > caf_by_group_skip_pca_grm [100%] 22 of 22 ✔
[e2/3c778d] process > merge_by_chr [100%] 22 of 22 ✔
[fe/33ebd4] process > combine_results [100%] 1 of 1 ✔
[8b/2020d3] process > annovar_input [100%] 1 of 1 ✔
[61/3a373f] process > plot [100%] 1 of 1 ✔
[66/6f4246] process > annovar [100%] 1 of 1 ✔
[85/d4266b] process > add_annovar [100%] 1 of 1 ✔
[9e/4fc2fe] process > report [100%] 1 of 1 ✔
Completed at: 15-Oct-2020 17:30:28
Duration : 44.1s
CPU hours : 0.1
Succeeded : 141
Alternative to Docker
If you are running the pipeline on a HPC that does not support docker
(BU’s Shared Computing Cluster), you can load the dependencies and run
the pipeline as follows. (In addition, you need to install following R
packages: SeqArray, GENESIS, Biobase, SeqVarTools, dplyr, SNPRelate,
ggplot2, data.table, reshape2, latex2exp, knitr, EBImage, GenomicRanges,
TxDb.Hsapiens.UCSC.hg19.knownGene, GMMAT, ezknitr)
$ module load R/4.1.1
$ module load vcftools/0.1.16
$ module load bcftools/1.10.2
$ module load plink/2.00a1LM
$ module load annovar/2018apr
$ module load pandoc/2.5
nextflow gwas.nf -c gwas.config
Underlying Structure and Output folder
Inputs and Configuration
Mandatory Input File Formats
1. Phenotype file: csv file
- The first column should be the unique ID for subjects
- Names of the columns and numbers of columns are not fixed
- The group variable is optional but should be a categorical variable
if called
- Longitudinal phenotype file shoud be in long-format
- If the pca_grm process is turned-off, PCs should present in the
phenotype file to be called
example: ./data/pheno_file_linear.csv
./data/pheno_file_logistic.csv
./data/1KG_pheno_linear.csv
./data/1KG_pheno_logistic.csv
./data/1KG_pheno_longitudinal.csv
pheno.dat <- read.csv("data/pheno_file_linear.csv")
kable(head(pheno.dat))
ID |
outcome |
age |
sex |
PC1 |
PC2 |
PC3 |
PC4 |
group |
202578640192_R09C01_202578640192_R09C01 |
-1.1259198 |
53.03908 |
F |
-0.0048 |
0.0211 |
0.0389 |
-0.0168 |
group2 |
202579010063_R05C02_202579010063_R05C02 |
-2.3237168 |
59.39922 |
F |
-0.0383 |
-0.0157 |
0.0061 |
0.0108 |
group2 |
202578650131_R04C02_202578650131_R04C02 |
0.0589976 |
22.27178 |
F |
-0.0356 |
-0.0149 |
-0.0159 |
0.0113 |
group3 |
202582730083_R09C01_202582730083_R09C01 |
0.9995060 |
68.75518 |
M |
0.0079 |
-0.0043 |
-0.0103 |
-0.0257 |
group1 |
202578640258_R03C02_202578640258_R03C02 |
-0.9547252 |
23.48552 |
F |
0.0148 |
-0.0079 |
0.0120 |
0.0058 |
group3 |
202578650131_R05C01_202578650131_R05C01 |
0.5786668 |
11.09063 |
M |
0.0065 |
0.0030 |
-0.0128 |
0.0157 |
group3 |
2. Genotype file: vcf.gz file
- vcf.gz files at least contains the GT column
- The ID column would end up being the snpID in the final output
- vcf.file should contain DS column to use dosages in GWAS (imputed=T)
example: ./data/vcf/vcf_file1.vcf.gz
./data/1KG_vcf/1KG_phase3_subset_chr1.vcf.gz
3. Mapping file: csv file
- Two-column csv file mapping the prefix to the vcf.gz files
- The results for each chromosome will be names be the corresponding
prefix
- NO header
example: ./data/toy_vcf.csv
./data/1KG_vcf.csv
map.dat <- read.csv("./data/toy_vcf.csv", header=F)
kable(head(map.dat))
V1 |
V2 |
chr_1 |
/nf-gwas-pipeline/data/vcf/vcf_file1.vcf.gz |
chr_2 |
/nf-gwas-pipeline/data/vcf/vcf_file2.vcf.gz |
chr_3 |
/nf-gwas-pipeline/data/vcf/vcf_file3.vcf.gz |
chr_4 |
/nf-gwas-pipeline/data/vcf/vcf_file4.vcf.gz |
chr_5 |
/nf-gwas-pipeline/data/vcf/vcf_file5.vcf.gz |
chr_6 |
/nf-gwas-pipeline/data/vcf/vcf_file6.vcf.gz |
Optional Input File Formats
1. SNP set
- Two column txt file seperated by “,”
- First column shoud be chromosome and second column be physical
position with fixed header “chr,pos”
example: ./data/snpset.txt
snp.dat <- fread("./data/snpset.txt")
kable(head(snp.dat))
chr |
pos |
1 |
1165522 |
1 |
1176433 |
1 |
1179532 |
1 |
1188944 |
1 |
1781220 |
2 |
1018108 |
2.Genetic relationship matrix
- A symmetric matrix saved in rds format with both columns being
subjects
- Can be replaced by 2*kinship matrix
grm <- readRDS("./data/grm.rds")
kable(grm[1:5,1:5])
|
HG00110 |
HG00116 |
HG00120 |
HG00128 |
HG00136 |
HG00110 |
1.0332116 |
-0.0179534 |
0.0070812 |
-0.0114037 |
-0.0122968 |
HG00116 |
-0.0179534 |
0.9901158 |
0.1161200 |
-0.0369330 |
-0.0204240 |
HG00120 |
0.0070812 |
0.1161200 |
0.9772376 |
-0.0595185 |
-0.0337373 |
HG00128 |
-0.0114037 |
-0.0369330 |
-0.0595185 |
0.9500809 |
-0.0373967 |
HG00136 |
-0.0122968 |
-0.0204240 |
-0.0337373 |
-0.0373967 |
0.9740444 |
GWAS example
Input file:
1. Phenotype csv
pheno.dat <- read.csv("./data/1KG_pheno_logistic.csv")
kable(head(pheno.dat))
sample.id |
Population |
sex |
outcome |
HG00110 |
GBR |
F |
1 |
HG00116 |
GBR |
M |
1 |
HG00120 |
GBR |
F |
0 |
HG00128 |
GBR |
F |
1 |
HG00136 |
GBR |
M |
0 |
HG00137 |
GBR |
F |
0 |
2. Mapping file
map.dat <- read.csv("./data/1KG_vcf.csv", header=F)
kable(head(map.dat))
V1 |
V2 |
chr_1 |
/nf-gwas-pipeline/data/1KG_vcf/1KG_phase3_subset_chr1.vcf.gz |
chr_2 |
/nf-gwas-pipeline/data/1KG_vcf/1KG_phase3_subset_chr2.vcf.gz |
chr_3 |
/nf-gwas-pipeline/data/1KG_vcf/1KG_phase3_subset_chr3.vcf.gz |
chr_4 |
/nf-gwas-pipeline/data/1KG_vcf/1KG_phase3_subset_chr4.vcf.gz |
chr_5 |
/nf-gwas-pipeline/data/1KG_vcf/1KG_phase3_subset_chr5.vcf.gz |
chr_6 |
/nf-gwas-pipeline/data/1KG_vcf/1KG_phase3_subset_chr6.vcf.gz |
3. Genotype file
See mapping file
Execution:
run with .config file:
nextflow run gwas.nf -c $PWD/configs/gwas_1KG_logistic.config
run with equivalent command:
nextflow run gwas.nf --vcf_list $PWD/data/1KG_vcf.csv --pheno $PWD/data/1KG_pheno_logistic.csv --phenotype outcome --covars sex,PC1,PC2,PC3,PC4 --pca_grm --model logistic --test Score --gwas --group Population --min_maf 0.1 --max_pval_manhattan 0.5 --max_pval 0.05 --ref_genome hg19
Gene-based example
Input file:
1. Phenotype csv
pheno.dat <- read.csv("./data/1KG_pheno_linear.csv")
kable(head(pheno.dat))
sample.id |
Population |
sex |
outcome |
HG00110 |
GBR |
F |
1.2114051 |
HG00116 |
GBR |
M |
1.4196076 |
HG00120 |
GBR |
F |
0.0119097 |
HG00128 |
GBR |
F |
0.6800792 |
HG00136 |
GBR |
M |
-2.3179815 |
HG00137 |
GBR |
F |
-1.4958842 |
2. Mapping file
map.dat <- read.csv("./data/1KG_vcf.csv", header=F)
kable(head(map.dat))
V1 |
V2 |
chr_1 |
/nf-gwas-pipeline/data/1KG_vcf/1KG_phase3_subset_chr1.vcf.gz |
chr_2 |
/nf-gwas-pipeline/data/1KG_vcf/1KG_phase3_subset_chr2.vcf.gz |
chr_3 |
/nf-gwas-pipeline/data/1KG_vcf/1KG_phase3_subset_chr3.vcf.gz |
chr_4 |
/nf-gwas-pipeline/data/1KG_vcf/1KG_phase3_subset_chr4.vcf.gz |
chr_5 |
/nf-gwas-pipeline/data/1KG_vcf/1KG_phase3_subset_chr5.vcf.gz |
chr_6 |
/nf-gwas-pipeline/data/1KG_vcf/1KG_phase3_subset_chr6.vcf.gz |
3. Genotype file
See mapping file
Execution:
run with .config file:
nextflow run gwas.nf -c $PWD/configs/gene_1KG_linear.config
run with equivalent command:
nextflow run gwas.nf --vcf_list $PWD/data/1KG_vcf.csv --pheno $PWD/data/1KG_pheno_linear.csv --phenotype outcome --covars PC1,PC2,PC3,PC4 --pca_grm --model linear --test Score --gene_based --group Population --max_pval 0.01 --ref_genome hg19
GWLA example
Input file:
1. Phenotype csv
pheno.dat <- read.csv("./data/1KG_pheno_longitudinal.csv")
kable(head(pheno.dat))
sample.id |
Population |
sex |
age |
delta.age |
outcome |
HG00110 |
GBR |
F |
46 |
0 |
1.2114051 |
HG00110 |
GBR |
F |
53 |
7 |
3.1471562 |
HG00116 |
GBR |
M |
51 |
0 |
1.4196076 |
HG00116 |
GBR |
M |
57 |
6 |
1.9318303 |
HG00120 |
GBR |
F |
49 |
0 |
0.0119097 |
HG00120 |
GBR |
F |
57 |
8 |
3.1782473 |
2. Mapping file
map.dat <- read.csv("./data/1KG_vcf.csv", header=F)
kable(head(map.dat))
V1 |
V2 |
chr_1 |
/nf-gwas-pipeline/data/1KG_vcf/1KG_phase3_subset_chr1.vcf.gz |
chr_2 |
/nf-gwas-pipeline/data/1KG_vcf/1KG_phase3_subset_chr2.vcf.gz |
chr_3 |
/nf-gwas-pipeline/data/1KG_vcf/1KG_phase3_subset_chr3.vcf.gz |
chr_4 |
/nf-gwas-pipeline/data/1KG_vcf/1KG_phase3_subset_chr4.vcf.gz |
chr_5 |
/nf-gwas-pipeline/data/1KG_vcf/1KG_phase3_subset_chr5.vcf.gz |
chr_6 |
/nf-gwas-pipeline/data/1KG_vcf/1KG_phase3_subset_chr6.vcf.gz |
3. Genotype file
See mapping file
Execution:
run with .config file:
nextflow run gwas.nf -c $PWD/configs/gwla_1KG_linear_slope.config
run with equivalent command:
nextflow run gwas.nf --vcf_list $PWD/data/1KG_vcf.csv --pheno $PWD/data/1KG_pheno_longitudinal.csv --phenotype outcome --covars sex,age,PC1,PC2,PC3,PC4 --pca_grm --model linear --test Score --longitudinal --random_slope delta.age --group Population --min_maf 0.1 --max_pval_manhattan 0.5 --max_pval 0.01 --ref_genome hg19
Help command
you can see explanations for all parameters with the help command:
nextflow gwas.nf --help
N E X T F L O W ~ version 19.04.1
Launching `gwas.nf` [tiny_venter] - revision: c9ded642f7
USAGE:
Mandatory arguments:
--vcf_list String Path to the two-column mapping csv file: id , file_path
--pheno String Path to the phenotype file
--phenotype String Name of the phenotype column
Optional arguments:
--gds_input Logical If true, ignore vcf input, start with GDS files and skip qc_miss, qc_mono, vcf_to_gds steps
--gds_list String Path to the two-column mapping gds file: id , file_path
--outdir String Path to the master folder to store all results
--covars String Name of the covariates to include in analysis model separated by comma (e.g. "age,sex,educ")
--qc Logical If true, run qc_miss(filter genotypes called below max_missing) and qc_mono (drop monomorphic SNPs)
--max_missing Numeric Threshold for qc_miss (filter genotypes called below this value)
--pca_grm Logical If true, run PCAiR (generate PCA in Related individuals) and PCRelate (generate genomic relationship matrix)
--snpset String Path to the two column txt file separated by comma: chr,pos (can only be effective when pca_grm = true)
--grm String Path to the genomic relationship matrix (can only be effective when pca_grm = false)
--model String Name of regression model for gwas: "linear" or "logistic"
--test String Name of statistical test for significance: "Score", "Score.SPA", "BinomiRare" and "CMP" (details see https://rdrr.io/bioc/GENESIS/man/assocTestSingle.html)
--gwas Logical If true, run gwas
--imputed Logical If true, use dosages in regression model (DS columns needed in input vcf files)
--gene_based Logical If true, run aggregate test for genes based on hg19 reference genome
--max_maf Numeric Threshold for maximun minor allele frequencies of SNPs to be aggregated
--method String Name of aggregation test method: "Burden", "SKAT", "fastSKAT", "SMMAT" or "SKATO"
--longitudinal Logical If true, run genome-wide longitudianl analysis
--random_slope String if set to "null", random intercept only model is run; else run random slope and random intercept model
--group String Name of the group variable based on which the allele frequencies in each subgroup is calculated (can be left empty)
--dosage Logical If true, also calculate dosages in addition to allele frequencies (can be very slow with large single gds input)
--min_maf Numeric Threshold for minimun minor allele frequencies of SNPs to include in QQ- and Manhattan-plot
--max_pval_manhattan Numeric Threshold for maximun p-value of SNPs to show in Manhattan-plot
--mac Numeric Threshold for SNPs with minor allele count above to be kept
--max_pval Numeric Threshold for maxumun p-value of SNPs to annotate
--ref_genome String Name of the reference genome for annotation: hg19 or hg38