populationgenomics / saige-tenk10k

Hail batch pipeline to run SAIGE on CPG's GCP
MIT License
0 stars 0 forks source link

Hail batch workflow to run SAIGE-QTL on TenK10K data

This is a Hail batch pipeline to run SAIGE-QTL on CPG's GCP, to map associations between both common and rare genetic variants and single-cell gene expression from peripheral mononuclear blood cells (PBMCs). First, this will be run on the TOB (AKA OneK1K) and then BioHEART datasets as part of phase 1 of the TenK10K project (see Data below), but in the future all datasets within OurDNA (with scRNA-seq + WGS data) will go through this pipeline as well.

The pipeline is split into three main parts, to make for more flexible usage:

  1. Genotype processing (SNVs and indels): this involves sample and variant QC of the WGS data, and genotype file preparation specifically for common and rare single-nucleotide variants and indels (VCF files), as well as plink files for only a subset of 2,000 variants that is used for some approximations within SAIGE-QTL
  2. Expression (phenotype) processing: this involves processing of the scRNA-seq data, inclusion of covariates, and preparation of the phenotype + covariate files (one per gene, cell type) and cis window files (one per gene)
  3. Association testing: prepare and run SAIGE-QTL commands for association mapping using inputs generated in the first two parts.

Additionally, two helper scripts are also part of this pipeline:

Genotypes preprocessing

Script: get_genotype_vcf.py

Variant selection for VCF files:

Variant selection for PLINK files for variance ratio estimation (VRE):

Inputs:

Outputs:

Notes: SAIGE-QTL allows numeric chromosomes only, so both the .bim and the VCF files are modified in this script to remove the 'chr' notation (so that e.g., 'chr1' becomes '1').

Get sample covariates

Script: get_sample_covariates.py

Inputs:

Outputs:

Notes: option to fill in missing values for sex (0, where 1 is male, 2 is female) and age (average age across the cohort). Additionally, add a user-specified (default: 10) number of permuted IDs, where the individual ID is permuted at random, to assess calibration (by shuffling the individual IDs we disrupt any real association between genotype and phenotype, so we expect no significant associations left when testing).

Gene expression preprocessing

Script: get_anndata.py

Inputs:

Outputs:

Notes: as before, we remove 'chr' from the chromosome name in the gene cis window file. Additionally, we turn hyphens ('-') into underscores ('_') in the gene names. Both the AnnData objects and cell covariate files are generated on Garvan's HPC and copied over to GCP.

Make group file

Script: make_group_file.py

Inputs:

Outputs

Notes: option to include no weights or to compute weights that reflect the distance of each variant from the gene's transcription start site (dTSS). Using one of the flags below it is possible to additionally test using equal weights. We use no annotations for now (set to null).

SAIGE-QTL association pipeline

Script: saige_assoc.py

Run this for single-variant tests (typically for common variants).

Inputs:

Outputs:

SAIGE-QTL RV association pipeline

Script: saige_assoc_set_test.py

Run this for set-based tests (typically for rare variants).

Inputs:

Outputs:

SAIGE-QTL parameters explanation

Clarifying the reasoning behind the parameters / flags used to run SAIGE-QTL. Most of these are (or will be) included in the official documentation.

Note: some of these are provided as arguments in the scripts (saige_assoc.py, saige_assoc_set_test.py), but most are provided as a separate config file (saige_assoc_test.toml). TO DO: update styling of flags to reflect this below.

Fit null model (step 1).

In script (saige_assoc.py, saige_assoc_set_test.py, using standard Snake Case naming convention as in the rest of the scripts):

In config (under [saige.build_fit_null] in saige_assoc_test.toml, using the Camel Case naming convention adopted in SAIGE-QTL):

Single-variant association testing (common variants step 2):

In script (saige_assoc.py):

In config (under [saige.sv_test] in saige_assoc_test.toml):

Obtain gene-level p-values (common variants only, step 3)

Set-based association testing (rare variants step 2):

In script (saige_assoc_set_test.py):

In config (under [saige.set_test] in saige_assoc_test.toml):

To run

Instructions to run each component of the pipeline using analysis runner are provided at the top of each script.

Briefly, if one wanted to run both common and rare variant pipelines, the order of running would be:

  1. get_genotype_vcf.py
  2. get_sample_covariates.py (does not require any other part of the pipeline and can be run in parallel with 1)
  3. get_anndata.py (requires 2)
  4. saige_assoc.py (requires 1,3)
  5. make_group_file.py (requires 1,3)
  6. saige_assoc_set_test.py (requires 1,3,4 (so that step1 is only run once) and 5)

Data

TenK10K is matched single-cell RNA-seq (scRNA-seq) and whole-genome sequencing (WGS) data from up to 10,000 individuals:

Additional resources