GEM (Gene-Environment interaction analysis for Millions of samples) is a software program for large-scale gene-environment interaction testing in samples from unrelated individuals. It enables genome-wide association studies in up to millions of samples while allowing for multiple exposures, control for genotype-covariate interactions, and robust inference.
Current version: 1.5.3
Additional documentation:
https://large-scale-gxe-methods.github.io/GEMShowcaseWorkspace
Option 1: Use the binary executable file for Linux
Option 2: Build GEM Library Dependencies
boost_program_options, boost_thread, boost_system, and boost_filesystem
To install GEM, run the following lines of code:
git clone https://github.com/large-scale-gxe-methods/GEM
cd GEM/
cd src/
make
C/C++ Compiler
LAPACK and BLAS
Intel processors:
AMD processors:
Boost C++ Libraries
Eigen Library
Once GEM is installed, the executable ./GEM
can be used to run the program.
For a list of options, use ./GEM --help
.
A file which should contain a sample identifier column and columns for the phenotypes, exposures, and covariates. The ordering of the columns does not matter. All inputs should be coded numerically (e.g., males/females as 0/1)
BGEN
Variants that are non-biallelic should be filtered from the BGEN file. Note that since there are no indication of a REF/ALT allele in the BGEN file, the second allele is the effect allele counted in association testing.
A .sample file is required as input when the .bgen file does not contain a sample identifier block.
.fam - The .fam file can be space or tab-delimited and must contain at least 2 columns where the first column is the family ID (FID) and the second column is the individual ID (IID). GEM will use the IID column for sample identifier matching with the phenotype file.
.bim - The .bim file can also be space or tab-delimited and should be in the following order: the chromosome, variant id, cM (optional), base-pair coordinate, ALT allele, and REF allele.
.bed - A bed file must be stored in variant-major form. The ALT allele specified in the .bim file is the effect allele counted in association testing.
.psam - The .psam file is a tab-delimited text file containing the sample information. If header lines are present, the last header line should contain a column with the name #IID (if the first column is not #FID) or IID (if the first column is #FID) that holds the individual ID for sample identifier matching with the phenotype file. All previous header lines will be ignored. If no header line beginning with #IID or #FID is present, then the columns are assumed to be in .fam file order.
.pvar - The .pvar file is a tab-delimited text file containing the variant information. If header lines are present, the last header line should start with #CHROM. If #CHROM is present, then the columns POS, ID, REF, and ALT must also be present. All previous header lines will be ignored. If the .pvar file contain no header lines beginning with #CHROM, it is assumed that the columns are in .bim file order.
.pgen - The .pgen file should be filtered for non-biallelic variants. The ALT allele specified in the .pvar file is the effect allele counted in association testing.
GEM will write results to the output file specified with the --out parameter (or 'gem.out' if no output file is specified).
Below are details of the possible column headers in the output file.
SNPID - The SNP identifier as retrieved from the genotype file.
RSID - The reference SNP ID number. (BGEN only)
CHR - The chromosome of the SNP.
POS - The physical position of the SNP.
Non_Effect_Allele - The allele not counted in association testing.
Effect_Allele - The allele that is counted in association testing.
N_Samples - The number of samples without missing genotypes.
AF - The allele frequency of the effect allele.
N_catE_* - The number of non-missing samples in each combination of strata for all of the categorical exposures and interaction covariates.
AF_catE_* - The allele frequency of the effect allele for each combination of strata for all of the catgorical exposure or interaction covariate.
Beta_Marginal - The coefficient estimate for the marginal genetic effect (i.e., from a model with no interaction terms).
SE_Beta_Marginal - The model-based SE associated with the marginal genetic effect estimate.
robust_SE_Beta_Marginal - The robust SE associated with the marginal genetic effect estimate.
Beta_G - The coefficient estimate for the genetic main effect (G).
Beta_G-* - The coefficient estimate for the interaction or interaction covariate terms.
SE_Beta_G - Model-based SE associated with the the genetic main effect (G).
SE_Beta_G-* - Model-based SE associated with any GxE or interaction covariate terms.
robust_SE_Beta_G - Robust SE associated with the the genetic main effect (G).
robust_SE_Beta_G-* - Robust SE associated with any GxE or interaction covariate terms.
Cov_Beta_G_G-* - Model-based covariance between the genetic main effect (G) and any GxE or interaction covariate terms.
Cov_Beta_G-*_G-* - Model-based covariance between any GxE or interaction covariate terms.
robust_Cov_Beta_G_G-* - Robust covariance between the genetic main effect (G) and any GxE or interaction covariate terms.
robust_Cov_Beta_G-*_G-* - Robust covariance between any GxE or interaction covariate terms.
P_Value_Marginal - Marginal genetic effect p-value from model-based SE.
P_Value_Interaction - Interaction effect p-value (K degrees of freedom test of interaction effect) from model-based SE. (K is number of major exposures)
P_Value_Joint - Joint test p-value (K+1 degrees of freedom test of genetic and interaction effect) from model-based SE.
robust_P_Value_Marginal - Marginal genetic effect p-value from robust SE.
robust_P_Value_Interaction - Interaction effect p-value from robust SE.
robust_P_Value_Joint - Joint test p-value (K+1 degrees of freedom test of genetic and interaction effect) from robust SE.
The --output-style flag can be used to specify which columns should be included in the output file:
Includes the variant information, Beta_Marginal, SE_Beta_Marginal, coefficient estimates for only the GxE terms, and depending on the --robust option, SE and covariance for only the GxE terms.
Includes each of the possible outputs listed above when applicable. For a model-based analysis (--robust 0), the columns containing the "robust" prefix (robust_*) are excluded in the output file.
Includes, in addition to "meta", an initial header line with the residual variance estimate necessary for re-analysis of a subset of interactions using only summary statistics (for example, switching an exposure and interaction covariate).
To run GEM using the example data, execute GEM with the following code.
./GEM --bgen example.bgen --sample example.sample --pheno-file example.pheno --sampleid-name sampleid --pheno-name pheno2 --covar-names cov3 --exposure-names cov1 --robust 1 --center 0 --missing-value NaN --out my_example.out
The results should look like the following output file my_example.out.
Version 1.5.3 - May 20, 2024:
Version 1.5.2 - August 16, 2023:
Version 1.5.1 - April 20, 2023:
Version 1.5 - March 9, 2023:
Version 1.4.5 - November 11, 2022:
Version 1.4.4 - October 5, 2022:
Version 1.4.3 - March 23, 2022:
Version 1.4.2 - November 22, 2021:
Version 1.4.1 - September 14, 2021:
Version 1.4 - July 2, 2021:
Version 1.3 - April 7, 2021:
Version 1.2 - January 22, 2021:
Version 1.1 - July 21, 2020:
For comments, suggestions, bug reports and questions, please contact Han Chen (Han.Chen.2@uth.tmc.edu), Alisa Manning (AKMANNING@mgh.harvard.edu), Kenny Westerman (KEWESTERMAN@mgh.harvard.edu) or Samaneh Salehi Nasab (Samaneh.SalehiNasab@uth.tmc.edu). For bug reports, please include an example to reproduce the problem without having to access your confidential data.
If you use GEM in your analysis, please cite
GEM : Gene-Environment interaction analysis for Millions of samples
Copyright (C) 2018-2024 Liang Hong, Han Chen, Duy Pham, Cong Pan, Samaneh Salehi Nasab
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program. If not, see <http://www.gnu.org/licenses/>.
The GEM package is distributed under GPL (>= 3). It includes source code from open source third-party software:
libdeflate: MIT
Plink: LGPLv3+
Zstandard (zstd): BSD_3_clause | GPL-2
The binary release of GEM also links to third-party libraries:
Boost: Boost Software License, Version 1.0
Eigen: Mozilla Public License, Version 2.0
Intel oneAPI Math Kernel Library (oneMKL): Intel Simplified Software License (Version October 2022 or later)
Full copies of license agreements for GEM, third-party source code, linked libraries can be found here.