This is a Hail batch pipeline to run SAIGE-QTL on CPG's GCP, to map associations between both common and rare genetic variants and single-cell gene expression from peripheral mononuclear blood cells (PBMCs). First, this will be run on the TOB (AKA OneK1K) and then BioHEART datasets as part of phase 1 of the TenK10K project (see Data below), but in the future all datasets within OurDNA (with scRNA-seq + WGS data) will go through this pipeline as well.
The pipeline is split into three main parts, to make for more flexible usage:
Additionally, two helper scripts are also part of this pipeline:
Script: get_genotype_vcf.py
Variant selection for VCF files:
--exclude-indels
and --exclude-multiallelic
).Variant selection for PLINK files for variance ratio estimation (VRE):
Inputs:
Outputs:
.csi
).csi
)Notes: SAIGE-QTL allows numeric chromosomes only, so both the .bim and the VCF files are modified in this script to remove the 'chr' notation (so that e.g., 'chr1' becomes '1').
Script: get_sample_covariates.py
Inputs:
Outputs:
Notes: option to fill in missing values for sex (0, where 1 is male, 2 is female) and age (average age across the cohort). Additionally, add a user-specified (default: 10) number of permuted IDs, where the individual ID is permuted at random, to assess calibration (by shuffling the individual IDs we disrupt any real association between genotype and phenotype, so we expect no significant associations left when testing).
Script: get_anndata.py
Inputs:
Outputs:
Notes: as before, we remove 'chr' from the chromosome name in the gene cis window file. Additionally, we turn hyphens ('-') into underscores ('_') in the gene names. Both the AnnData objects and cell covariate files are generated on Garvan's HPC and copied over to GCP.
Script: make_group_file.py
Inputs:
Outputs
Notes: option to include no weights or to compute weights that reflect the distance of each variant from the gene's transcription start site (dTSS
).
Using one of the flags below it is possible to additionally test using equal weights.
We use no annotations for now (set to null
).
Script: saige_assoc.py
Run this for single-variant tests (typically for common variants).
Inputs:
Outputs:
Script: saige_assoc_set_test.py
Run this for set-based tests (typically for rare variants).
Inputs:
Outputs:
Clarifying the reasoning behind the parameters / flags used to run SAIGE-QTL. Most of these are (or will be) included in the official documentation.
Note: some of these are provided as arguments in the scripts (saige_assoc.py
, saige_assoc_set_test.py
), but most are provided as a separate config file (saige_assoc_test.toml
). TO DO: update styling of flags to reflect this below.
Fit null model (step 1).
In script (saige_assoc.py
, saige_assoc_set_test.py
, using standard Snake Case naming convention as in the rest of the scripts):
pheno_file
: path specifying the location of the phenotype covariate file described above (built during part 2 of the pipeline).output_prefix
: path to where the output files from step 1 (which will be read by step 2) should be written to.plink_path
: path to VRE plink files (specify just the prefix, but a .bim, .fam, and .bed files with the same prefix and in the same location should exist -- these are built in part 1).pheno_col
: specify the column that should be used as phenotype, in our case the gene to test.In config (under [saige.build_fit_null]
in saige_assoc_test.toml
, using the Camel Case naming convention adopted in SAIGE-QTL):
covarColList
: string specifying the columns of the pheno_file that should be used as covariates (separated by a comma, no spaces).sampleCovarColList
: same as above, but specifying only, out of the columns above, the ones that are well defined at individual level (e.g., sex, age, ancestry PCs). Both this and the above need to be specified, and this is always a subset of the above, which allows individual-level covariates to be processed more cheaply.sampleIDColinphenoFile
: specify the column that represents individual IDs (note, for calibration analysis use one of the permuted ids here).traitType
: specify the model to be used, count
is the Poisson model which should be used here.skipVarianceRatioEstimation
: boolean specifying whether the variance ratio estimation should be run or skipped, should always be false in our case (note that because the syntax of booleans is different between Python and R we encode this as the string FALSE
instead of the boolean False
).isRemoveZerosinPheno
: option to remove 0s from phenotype vector (default: FALSE
as it does not make sense for the very sparse scRNA-seq data, i.e. these are not missing data!).useSparseGRMtoFitNULL
: use sparse GRM to account for relatedness. This is implemented but would require a step0 in the pipeline to first construct this, which is not there at the moment (default: FALSE
).useGRMtoFitNULL
: same as above, but without the sparse option (default: FALSE
).isCovariateOffset
: use covariates as offset allows to only estimate coefficients once and reduce cost (check that the model retains calibration, in which case set to TRUE
for cost optimisation).isCovariateTransform
: transform (explain) covariates (default: TRUE
).skipModelFitting
: boolean (default: FALSE
).tol
: convergence tolerance (default: 0.00001, which works well in our hands, we could test with a larger number to decrease cost).maxiterPCG
: convergence max number of iterations (default: 500 but increase to 5,000 for the specific genes whose job does not converge).IsOverwriteVarianceRatioFile
: if the file already exists, skip or overwrite, default is the latter (i.e. default: TRUE
).Single-variant association testing (common variants step 2):
In script (saige_assoc.py
):
vcf_file
: path to VCF file containing genetic variants to test. An index file that is called exactly the same, with .csi
at the end, needs to exist at the same location, but is not passed in as an argument.sv_output_path
: path to output file.chrom
: chromosome to test.cis_window_file
: path to file specifying cis window / region to test (generated in part 2 of the pipeline, get anndata script).gmmat_model_path
: path to estimated null model (.rda) generated in step 1.variance_ratio_path
: path to variance ratio txt file generated in step 1.In config (under [saige.sv_test]
in saige_assoc_test.toml
):
vcfField
: DS for dosages, GT for genotypes (default: GT
).minMAF
: minimum minor allele frequency (MAF) (default: 0).minMAC
: minimum minor allele count (MAC) (default: 5). Note that if this filter is discordant with the above one, the more stringent one will be applied (max between the two will be used).LOCO
: boolean specifying whether leave-one-chromosome-out should be used (default: FALSE
).markers_per_chunk
: internal parameter to batchify variants tested (default: 10000). Confusingly the naming is Snake Case here.SPAcutoff
: internal parameter to do with the saddlepoint approximation, does not make much of a difference for us (default: 10000).Obtain gene-level p-values (common variants only, step 3)
gene_name
: gene to aggregate values for,saige_sv_output_file
: path to output from step 2 (input here),saige_gene_pval_output_file
: path to output (step 3).Set-based association testing (rare variants step 2):
In script (saige_assoc_set_test.py
):
vcf_file
: path to VCF file containing genetic variants to test. As above, an index file that is called exactly the same, with .csi
at the end, needs to exist at the same location, but is not passed in as an argument.set_output_path
: path to output file (if kept the same as single-variant test, step 1 only needs to be run once)chrom
: chromosome to test.group_file
: path to file specifying variants to test (generated in the make_group_file.py
script).gmmat_model_path
: path to estimated null model (.rda) generated in step 1.variance_ratio_path
: path to variance ratio txt file generated in step 1.In config (under [saige.set_test]
in saige_assoc_test.toml
):
vcfField
: DS for dosages, GT for genotypes (default: GT
).maxMAF_in_groupTest
: max MAF of variants that should be included in the set test (default: 0.1).minMAF_in_groupTest_Exclude
: min MAF of variants to exclude from the set test (default: 0).annotation_in_groupTest
: annotations from the group file to test as separate sets (default: null
).MACCutoff_to_CollapseUltraRare
: minor allele counts (MAC) cutoff under which variants are considered ultrarare and get collapsed (default: 10).is_single_in_groupTest
: run single-variant tests for variants in group file as well (default: TRUE
).is_equal_weight_in_groupTest
: test using equal weights as well as using a Beta(1,25) distribution which is used by default; if set to true, both results get output (default: TRUE
).Instructions to run each component of the pipeline using analysis runner are provided at the top of each script.
Briefly, if one wanted to run both common and rare variant pipelines, the order of running would be:
TenK10K is matched single-cell RNA-seq (scRNA-seq) and whole-genome sequencing (WGS) data from up to 10,000 individuals: