In this GitHub repository we provide code and a step-by-step guideline for the epiTree pipeline, to extract epistatic higher-order interactions from genotype - phenotype data.
A detailed explanaition of the pipeline can be found at the following manuscript
Behr M, Kumbier K, Cordova-Palomera A, Matthew Aguirre M, Ashley E, Butte E, Arnaout R, Brown B, Priest J†, Yu B† Learning epistatic polygenic phenotypes with boolean interactions https://www.biorxiv.org/content/10.1101/2020.11.24.396846v1
To run the pipeline, as it is presented in the following, the following software requirements are needed
The epiTree pipeline requires the following data input:
The first step of the epiTree pipeline is to perform a biologically inspired dimension reduction step by imputing tissue specific gene expression levels from SNP data, which can be done with the PrediXcan software [1]. Depending on the PrediXcan database that is used, this reduces the feature dimension from several million SNPs to a few thousand gene-level feautures.
A detailed description for how to obtain imputed gene expression levels, is provided by the authors of PrediXcan, see https://github.com/hakyimlab/PrediXcan and manuscript [1]. For sake of completeness, we provide a step-by-step description below.
To speed up computation time and avoid memory issues, in a first step we will create new Plink files that only contain the SNPs that actually enter in the particular PrediXcan model. Toward this end, we first obtain a list of all SNPs in a given PrediXcan data base. An R script for generating this list is provided in
scripts/utilities_snpFromDB.R
Note that you might need to adjust name.db
and path.db
to provide the correct name and path for your database file.
Given the list of SNPs from the previous step, we generate new Plink files that only contain those SNPs using the script:
scripts/submit_plinkSubset.sh
You will need to modify the script above to specify:
To consider a subset of subjects from your original Plink file, adjust the .fam file to include the desired subjects. Otherwise, you can take the .fam file from your input Plink files. Recall that your input Plink files are assumed to be one (bim/bam/fam) file per chromosome, named as name_plink_file_chr1, name_plink_file_chr2, etc..
You will need to run this script for each chromosome separately, so for example, when you consider chromosome 1 - 22, you can run
for i in {1..22}; do ./submit_plinkSubset.sh $i; done
PrediXcan requires genotype input files as dosage files. The following script submits jobs that transform the Plink files from the previous step into dosage files, using the python script “convert_plink_to_dosage.py” from https://github.com/hakyimlab/PrediXcan/blob/master/Software/convert_plink_to_dosage.py.
scripts/submit_plink2dos.sh
You will need to modify the script above to specify:
Given dosage files from the previous step, we can run PrediXcan. The following script submits a job for this.
scripts/submit_dos2predix.sh
Note that you will need to run the script only once and not for each chromosome separately (it is assumed that there is one dosage file per chromosome, as was generated by the previous script). You will need to modify the script above to specify:
The PrediXcan.py
file that we provide in the scripts
folder here, is the one that we used for our analysis on the red-hair phenotpye, see [2], you may download the latest version at https://github.com/hakyimlab/MetaXcan (see also https://github.com/hakyimlab/PrediXcan).
With imputed gene expression data from the previous step, we can run the gene level analysis of the epiTree pipeline. First, we split data into training and test sets. As an example, in the following scripts we select a balanced sample of 26K training and 4K test samples, as in [2]. For different sample size, the scripts can be easily adapted.
As a first step, using the training data only, we extract candidate interactions on the gene level using the iRF pipeline [3], [4]. The following R script uses the iRF R package to do this. It also runs two competing prediction methods, namely penalized logistic regression with L1 penality using the glm R package and random forest as in the ranger R package implementation with default parameters.
scripts/analysis_iRF_gene.R
Note that, as before, you will need specify
In our analysis, a small subset of genes were excluded and are specified in the file /data/geneExclusionList.txt
. The path to this file is specified via the path.excl argument. However, none of those genes was present in the PrediXcan database /data/gtex_v7_Skin_Sun_Exposed_Lower_leg_imputed_europeans_tw_0.5_signif.db
used for our red hair analysis. In practice, one can modify this file accordingly.
For the candidate interactions from the previous step, we can now compute CART based PCS p-values, using the test data to evaluate their significane. Note that this step also requires the training data from the previous step, as it is used to fit the individual CART components. This is done in the following R script, which also computes standard p-values from logistic regression with a multiplicative interaction term for comparison, using the glm and lmtest R packages. As before, you will need to adjust the paths accordingly.
scripts/analysis_pcsPvalues_gene.R
We stress that PCS p-values can also be computed independently from the iRF model selection step of the previous section. An implementation of CART based PCS p-values can be found in the script scripts/utilities_tests.R
.
From the candidate interactions between imputed gene expression features, we can go back to the SNP level, to seach for interactions between SNPs that correspond to genes extracted from the previous step.
The following R script uses the results from the previous step to extract the coordinates of each gene (start/end location +/- 1K base pairs) that appears in an iRF interaction or in the top 50 genes in terms of Gini importance. BAsed on these coordinates, the script extracts a list of all SNPs within those regions from the input Plink files. As before, one has to specify the path of the plink files and PrediXcan database accordingly.
scripts/utilities_extractSNPsfromGenes.R
Given this list of SNPs, one can use again the script submit_plinkSubset.sh
to generate Plink files that only contain those SNPs.
Given the Plink files from the previous step, with reduced number of SNPs, we can now again use iRF to search for stable interactions on the SNP level, as done by the following script (with paths specified accordingly).
scripts/analysis_iRF_snp.R
For the candidate interactions from the previous step, we can now compute PCS p-values on the SNP level, as in the following script (with paths specified accordingly).
scripts/analysis_pcsPvalues_snp.R
All results and output files are stored in the results
folder.
We have added the results that we obtained in our red hair analysis [2] to this github repository and demonstrate some visualization and reproduction of figures from [2] in the R Jupyter Notebook red_hair_analysis.ipynb.
[1] Gamazon ER†, Wheeler HE†, Shah KP†, Mozaffari SV, Aquino-Michaels K, Carroll RJ, Eyler AE, Denny JC, Nicolae DL, Cox NJ, Im HK. (2015) A gene-based association method for mapping traits using reference transcriptome data. Nat Genet. doi:10.1038/ng.3367. (Link to paper, Link to Preprint on BioRxiv)
[2] Behr M, Kumbier K, Cordova-Palomera A, Matthew Aguirre M, Ashley E, Butte E, Arnaout R, Brown B, Priest J†, Yu B† (2020) Learning epistatic polygenic phenotypes with boolean interactions
[3] Basu S, Kumbier K, Brown B, Yu B (2018) Iterative random forests to discover predictive and stable high-order interactions. PNAS. https://doi.org/10.1073/pnas.1711236115
[4] Kumbier K, Basu S, Brown B, Celniker S, Yu B (2018). Refining interaction search through signed iterative Random Forests https://arxiv.org/abs/1810.07287