signet4grn / SIGNET

Construct genome wide gene regulatory networks
https://signet4grn.readthedocs.io/en/latest/
Other
3 stars 1 forks source link

SIGNET User's Manual

Reference to SIGNET

SIGNET is based on the paper Jiang et al. (2023). The core of the current version of SIGNET is to use the two-stage penalized least squares (2SPLS) method proposed by Chen et al. (2018) to construct genome-wide gene regulatory networks (GW-GRNs). An application of 2SPLS to yeast data can be found in Chen et al. (2019).

While current SIGNET constructs the GW-GRN with transcriptomic and genotype data collected from on population, we are developing SIGNET to simultaneiously construct and compare GW-GRNs for two or more populations for the purpose of (i) more powerful to establish comment reguations shared across different populations; (ii) more effectively identify population-specific (e.g., cancer-specific) gene regulations.

System Requirement

SIGNET runs on a UNIX bash shell. Check your shell with echo $SHELL to make sure that you are running on UNIX bash shell. SIGNET uses the Slurm Workload Manager for high performance computing (HPC) clusters in its stage of constructing the gene regulatory network in parallel.

Quick Installation of SIGNET

First you should clone the directory to the path in your server and add the path where you install the software to enable directly running the command without specifying a particular path.

git clone https://github.com/signet4grn/SIGNET.git
cd SIGNET
export PATH=/path/to/signet:$PATH

where /path/to/signet should be replaced with your path to SIGNET.

Installation of Required Packages

SIGNET runs dependent on several packages such as PLINK, IMPUTE2, and R (with its libraries). While you may install all of these packages by yourself, we also provide a Singularity container signet.sif which packs all the packages required by SIGNET. The Singularity container signet.sif provides an environment in which SIGNET can smoothly run, so you don't have to separately install any of the required packages for SIGNET.

Before having the Singularity container signet.sif, first you have to install Singularity following https://sylabs.io/guides/3.8/user-guide/quick_start.html#quick-installation-steps.

You can pull the image from our repository and rename it as signet.sif, after which you can append the path of package to singularity so it can execute SIGNET smoothly. You may also need to bind a path in case container doesn't recognize your file. The environment variables have to be exported everytime you start a new terminal.

singularity pull library://jiang548/signet/signet:0.0.6
export SINGULARITYENV_APPEND_PATH="/path/to/signet"
export SINGULARITY_BIND="/path/to/bind"

where /path/to/signet should be replaced with your path to SIGNET, and /path/to/bind should be replaced with the desired bath to bind.

You can use the image by attaching a prefix ahead of the original commands you want to execute, which are described in details in sections below.

singularity exec signet.sif [Command]
e.g. 
singularity exec signet.sif signet -s 

Or you could first shell into the container by

singularity shell signet.sif

and then execute all the commands as usual.

Caution: All the intermediate result for each step will by default return to the corresponding folders in the tmporary directory starting with 'tmp' and all the final result will return to the result folders starting with 'res'. You could also change the path of result files in the configuration file named config.ini, or use signet -s described below. Please be careful if you are using the relative path instead of the absolute path. The config.ini will record the path relative to the folder that SIGNET is installed, in order to reach file mangement consistency. It's highly recommended to run command where signet is installed. In each of the process, you could specify the result path, and you will be asked to whether purge the tmporary files, if you already have those. It's also suggested you keep a copy of the temporary files for each analysis, in case you need them in later steps. Please run each analysis at a time under the same folder, as a latter process will overwrite the previous tmporary files.

Introduction

This streamline project provide users easy linux user interface for constructing whole-genome gene regulatory networks using transciptomic data (from RNA sequence) and genotypic data.

Procedures of constructing gene regulatory networks can be split into six main steps:

  1. genotype preprocess
  2. gene expression preprocess
  3. adjust for covariates
  4. cis-eQTL analysis
  5. network analysis
  6. network visualization

To use this streamline tool, user need first to prepare the genetype data in vcf format. Then set the configuration file properly, and run each step command seperately.

Quick Start

1. Prepare the DataSet

We highly recommand you to prepare the gene expression data and genotype data first, and place them to a specific data folder, to organize each step as it may involve many files.

2. Set configuration

Here we set the number of autosomes to 22, so the chromosomes we study are 1-22.

signet -s --nchr 22

We can use the command to check below to check autosome number

signet -s --nchr

That is, when no value is provided, we will display the value of the specified parameter. We can also use

signet -s

to display the values of all parameters. We may also provide a way to reset the value of one parameter or all parameters to default values.

signet -s --d

or

signet -s --nchr --d

3. Genotype Preprocess

For preprocessing genotype data

signet -g

4. Gene Expression Preprocess

For preprocessing transcriptomic (gene expression) data

signet -t

5. cis-eQTL Analysis

For cis-eQTL analysis.

signet -c

6. Network Analysis

For network construction.

signet -n 

7. Network Visualization

For network visualization.

signet -v 

Command Guide

Please note that you have to run genotype preprocessing before gene expression preprocessing if you are using the GTEx cohort

Settings

signet -s command is used for look up and modify parameter in the configuration file config.ini. You don't have to modify the parameters at the very beginning, as you will have options to change your input parameters in each step.

click here for detailed introduction for configuration file.

Usage

signet -s [--PARAM] [PARAM VAL] 

Description

    --PARAM                                      list the value of parameter PARAM
    --PARAM [PARAM VAL]      modify the value of parameter PARAM to be [PARAM VAL]

Example

# list all the parameters
signet -s 
## echo: all the current parameters

# List the paramter
signet -s --nchr
## echo: 22

# Replace s with settings would also work
signet -settings --nchr 

# Modify the paramter
signet -s --nchr 22
## echo: Modification applied to nchr

# Set all the parameters to default 
signet -s --d 
## echo: Set all the parameters to default 

Error input handling

# If you input wrong format such as "-nchr"
signet -s -nchr
echo: The usage and description instruction.

# If you input wrong name such as "-nchro"
echo: Please check the file name

Transcript-prep

(TCGA)

The command signet -t will take the matrix of base-2 logarithm transformed gene count data and preprocess it. Each row represents the data for each gene, and each column represents the data for each sample, while the first row is the sample name, and the first column is the gene name. Note that the last 5 rows are not considered in the analysis since they contain ambigous gene information by UCSC.

In this step, we will filter out genes with total counts less than 2.5 million according to NIH standard and are counted in less than 20% of the samples, after which we will apply variance stablizing transformation by DESeq2 to normalize data. Furthermore, we will only focus on protein coding genes.

Usage

signet -t [--g GEXP_FILE] [--p MAP_FILE]

Description

 --g | --gexp                   gene expression file
 --p | --pmap                   genecode gtf file
 --restrict                     restrict the chromosomes of study
 --r | --rest                   result prefix

Result files

Output of gexp-prep will be saved to res/rest.

Example

# List the paramter
signet -t --help
## Display the help page 

# Modify the paramter
signet -t --g data/gexp-prep/TCGA-LUAD.htseq_counts.tsv \
          --p data/gexp-prep/gencode.v22.gene.gtf \
      --restrict 1

## The preprocessed gene expresion result with correpsonding position file will be stored in /res/rest/

(GTEx)

We adopted and modified the code from GTEx pipeline.

Usage

signet -t [--r READS_FILE] [--tpm TPM_FILE]

Description

 --r | --read                    gene reads file in gct format
 --t | --tpm                     gene tpm file
 --g | --gtf                     genecode gtf file
 --rest                          result prefix

Example

# List the paramter
signet -t --help
## Display the help page 

# Modify the paramter
signet -t --reads data/gexp/GTEx_gene_reads.gct \
          --tpm data/gexp/GTEx_gene_tpm.gct \
          --gtf data/gexp-prep/gencode.v26.GRCh38.genes.gtf

## The preprocessed gene expresion result with correpsonding position file will be stored in /res/rest/

Geno-prep

(TCGA)

The command signet -g provides the user interface of preprocessing genotype data. We first use PLINK to conduct quality control, filtering out samples and SNPs with high missing rates and filtering SNPs discordant with Hardy Weinberg equilibruim. We then use IMPUTE2 to impute missing genotypes in parallel.

Usage

signet -g [OPTION VAL] ...

Description

  --p | --ped                   ped file
  --m | --map                   map file
  --mind                        missing rate per individual cutoff
  --geno                        missing rate per markder cutoff
  --hwe                         Hardy-Weinberg equilibrium cutoff
  --nchr                        chromosome number
  --restrict                    restrict to the chromosome of interest
  --r | --ref                   reference file for imputation
  --gmap                        genomic map file
  --i | --int                   interval length for IMPUTE2
  --ncores                      number of cores
  --resg                        result prefix

Example

# List the paramter
signet -g --help
## Display the help page 

# Modify the paramter
signet -g --ped data/geno-prep/test.ped \
          --map data/geno-prep/test.map \
      --ref data/ref_panel_38/chr \
      --gmap data/gmap/chr

Result files

Two files will be generated from preprocessing the genoytpe data (The filename begins with signet by default, you are able to customize it by setting an additional flag --resg. e.g. --resg res/resg/[younameit]):

(GTEx)
signet -g command provide the user the interface of preprocessing genotype data. We will first extract the genotype data that has corresponding samples from gene expression data for a particular tissue, and then select SNPs that have at least count 5.

Output of geno-prep will be saved under /res/resg:

Usage

signet -g [OPTION VAL] ...

Description

 --vcf0                        set the VCF file for genotype data before phasing   
 --vcf                         set the VCF file for genotype data, the genotype data is from GTEx after phasing using SHAPEIT
 --read                        set the read file for gene expression read count data in gct format
 --anno                        set the annotation file that contains the sample information
 --tissue                      set the tissue type

Example

# Set the cohort to GTEx
signet -s --cohort GTEx

# Modify the paramter
signet -g --vcf0 data/geno-prep/Geno_GTEx.vcf \
          --vcf data/genotype_after_phasing/Geno_GTEx.vcf \
          --read data/gexp/GTEx_gene_reads.gct \
      --anno data/GTEx_Analysis_v8_Annotations_SampleAttributesDS.txt \
      --tissue Lung

Result files

Output of signet -g will be saved to res/resg.

Adj

The command signet -a provides users the interface to match genotype and gene expression files, calculate principal components (PCs) for population stratification, adjust for covariates effect by top PCs, races and gender. Then calculate the minor allele frequency (MAF).

Note that signet -a reads the output from signet -g and signet -t.

output of adj will be saved under /res/resa:

(TCGA)

Usage

signet -a [--c CLINIVAL_FILE]

Description

 --c | clinical                   clinical file for your cohort
 --resa                           result prefix

Example

signet -a --c ./data/clinical.tsv

Output of adj will be saved to res/resa:

(GTEx)

Usage

signet -a [--p PHENOTYPE_FILE]

Description

 --pheno                          GTEx phenotype file
 --resa                           result prefix

-pheno: phenotype file from the GTEx v8.

Example

signet -a --pheno \
./data/pheno.txt 

Cis-eqtl

Now that we have completed all necessary preprocessing, normalization, and data cleaning, we are ready to perform cis-eQTL mapping. If you want to construct GRN with your own data (rather than TCGA or GTEx data), you should preprocess your data by yourself (instead of above functions provided by SIGNET) and then use SIGNET from this step.

Before you start this step, please make sure that you have the following files ready:

Caution: Genes in the two gene expression files are arranged according to the order of genes in the gene position information file.

For each gene, we will use an adaptive rank sum permutaion test to identify its cis-eQTL as instrumental variables. Therefore, the possible instrumental variables of a specific gene include any genetic variants within its coding region as well as upstream and downstream regions up to certain ranges which will be specified by options --upstream and --downstream, respectivly.

Usage

signet -c [OPTION VAL] ...

Description

  --gexp                        file of gene expressions adjusted for all covariates, matched with genotype data
  --gexp.withpc                 file of gene expressions adjusted for all covariates other than top PCs, matched with genotype data
  --geno                        file of genotype data matched with gene expression data
  --map                         snps map file path
  --maf                         snps maf file path
  --gene_pos                    gene position file
  --alpha | -a          significance level for cis-eQTL
  --nperms                  number of permutations
  --upstream                number of base pairs upstream the genetic region
  --downstream                  number of base pairs downstream the genetic region
  --resc                        result prefix
  --help | -h           user guide

Result files

Output of cie-eQTL will be saved to res/resc:

Example

 signet -c --upstream 1000 \
           --downstream 1000 \
       --nperms 100 \
       --alpha 0.05

Network

The command signet -n provides the tools for constructing a GRN using the two-stage penalized least squares (2SPLS) approach proposed by Chen et al. (2018). Note that the same set of data will be bootstrapped nboots times and each bootstrapping data set will be used to construct a GRN. The frequencies of the regulations appeared in the nboots GRNs will be used to evaluate the robustness of constructed GRN with higher frequency implying more robust regulation.

network receive the input from the previous step, or it could be the output data from your own pipeline:

Caution Please make sure that you are using the SLURM system. Please also don't run this step inside a container, as the singularity container is integrated as part of the procedure.

Usage

signet -n [OPTION VAL] ...

Description

  --net.gexp.data               gene expression data for GRN construction
  --net.geno.data               marker data for GRN construction
  --sig.pair                    significant index pairs for gene expression and markers
  --net.genename                gene name files for gene expression data
  --net.genepos                 gene position files for gene expression data
  --ncis                        maximum number of biomarkers for each gene
  --cor                         maximum correlation between biomarkers
  --nboots                      number of bootstraps datasets
  --memory                      memory in each node in GB
  --queue                       queue name
  --ncores                      number of cores to use for each node
  --walltime            maximum walltime of the server
  --interactive                 T, F for interactive job scheduling or not
  --resn                        result prefix
  --sif                         singularity container
  --email                       send notification emails after each stage is compeleted if you have mail installed in Linux, and interactive=F

Result files

Example

signet -n --nboots 100 \
          --queue standby \
      --walltime 4:00:00 \
      --memory 256

Netvis

signet -v provide tools to visualize our constructed gene regulatory networks. Users can choose the bootstrap frequency threshold and number of subnetworks to visualize the network.

You should first ssh -Y $(hostname) to a server with DISPLAY if you would like to use the singularity container, and the result can be viewed through a pop up firefox web browser

Usage

signet -v [OPTION VAL] ...

Description

  --Afreq                      matrix of regulation frequencies from bootstrap results
  --freq                       bootstrap frequecy for the visualization
  --ntop                       number of top sub-networks to visualize
  --coef                       coefficient of estimation for the original dataset
  --vis.genepos                gene position file
  --id                         NCBI taxonomy id, e.g. 9606 for Homo Sapiens, 10090 for Mus musculus
  --assembly                   genome assembly, e.g. hg38 for Homo Sapiens, mm10 for Mus musculus
  --tf                         transcirption factor file, only sepecified for non-human data
  --resv                       result prefix
  --help                       usage

Result files

Example

signet -v 

Appendix

Configuration File

config.ini file is under the main folder and saving the costomized parameters for all of the stages of signet process. Settings in config.ini are orgnized by different sections.

Users can change the SIGNET process by modifying the paramter settings in the configuration file.

File Structure

# script folder save all the code
    - script/
    - gexp_prep
    - geno_prep
    - adj
    - cis-eQTL
    - network 
    - netvis

Paper

@article{jiang2023signet,
  title={SIGNET: transcriptome-wide causal inference for gene regulatory networks},
  author={Jiang, Zhongli and Chen, Chen and Xu, Zhenyu and Wang, Xiaojian and Zhang, Min and Zhang, Dabao},
  journal={Scientific Reports},
  volume={13},
  number={1},
  pages={19371},
  year={2023},
  publisher={Nature Publishing Group UK London}
}