scGRN
- Gene regulatory network inference and analysis based on scRNA-sequencing dataThis repository contains the computational pipeline of the GRN-inference analysis based on expression of single cells. scRNA-sequencing datasets capture rich information about the transciptomic gene levels across thousands of cells and thus could be used to describe the gene interaction. All gene interactions could be depicted as gene regulatory networks (GRNs) that summarize genetic communication and regulation as a graph network. The focus of this repository is to provide an end-to-end pipeline to infer and analyze the GRNs that were computed based on scRNA-sequencing data.
As a case study, we focus on the COVID-19 patient dataset (Liao et al., 2020). The data is available on GEO (GSE145926) and contains 10xGenomics pipeline samples from the lung immune microenvironment in the bronchoalveolar lavage fluid (BALF) from 6 severe and 3 moderate COVID-19 patients and 3 healthy controls (look below).
Seurat
and cell type identification using SingleR
arboreto
and pyscenic
(particularly with GRNBoost2
algorithm)EnrichR
and clusterProfiler
, generation of functional gene networksWe will work with two types of networks:
The source code is available in the scGRN
folder which consists of the following submodules:
single_cell_processing
- processing of single cell RNA-sequencing data
sc_pipeline
- Seurat
pipeline for single cell data processing, includes quality control, normalization, dimensionality reduction, clustering, cell type identification and data aggregationregulon_pipeline
- regulon enrichment pipeline using VIPER
. Regulon is a set of genes that are regulated by a common regulatory protein, e.g. transcription factor. VIPER
measures the activity of different regulons based on the co-expression of corresponding genes. Either pyscenic
or DoRothEA
regulons can be usednetwork_inference
- gene regulatory network inference pipeline using pySCENIC
, look in Types of inferred networks section for more detailsnetwork_analysis
- analysis of inferred networks, includes exploratory data analysis (EDA), visualization, community detection and enrichment analysisThe work is heavily based on the Marenostrum supercomputer, thus the code is optimized for working in the HPC environment (i.e. Slurm). The single_cell_processing
pipeline is implemented in R, while network_inference
and network_analysis
are implemented in Python. Both pipelines are wrapped in bash scripts for easy execution. Some scripts are tailored to the Marenostrum file system (file names conventions, Slurm-based workload commands), but could be easily adapted to other HPC environments. The community analysis provided in network_analysis
could be also run at scale using Slurm manager. You can see the examples of the sbatch
and greasy
commands in the notebooks/Generate_sbatch_commands.ipynb
notebook. For more details of the pipeline usage, please look in the corresponding README.md
files in the submodules.
The network_analysis
pipeline is implemented as a Python package that the user can use to analyze the inferred networks. It includes utilities for I/O operations, graph analysis, data aggregation and visualization. As an example, to read the patient sample metadata:
import scGRN
# Setting file system, also defined in config.py
_PROJ_HOME = (
"/gpfs/projects/bsc08/shared_projects/scGRN_analysis"
)
_FMETA = (
f"{_PROJ_HOME}/Data_home/data/GSE145926_RAW/metadata.tsv"
)
_DATA_HOME = f"{_PROJ_HOME}/Data_home/res/covid_19"
# Loading full metadata for all patients
full_meta = scGRN.ana.get_meta(_DATA_HOME, _FMETA)
print(full_meta.shape)
print(full_meta.columns)
(12, 4)
Index(['group', 'file', 'num_cells', 'Macrophage', 'T_cells', 'DC',
'Pre-B_cell_CD34-', 'Monocyte', 'NK_cell', 'B_cell', 'Epithelial_cells',
'BM', 'Pro-B_cell_CD34+', 'HSC_-G-CSF', 'CMP', 'Neutrophils', 'GMP',
'Erythroblast', 'Gametocytes', 'Neurons', 'Fibroblasts',
'Smooth_muscle_cells', 'Hepatocytes', 'Keratinocytes', 'Pro-Myelocyte'],
dtype='object')
Let's display some patient samples and their cell type composition:
print(full_meta)
Output:
id | group | file | num_cells | Macrophage | T_cells | DC | Pre-B_cell_CD34- | Monocyte | NK_cell | B_cell | Epithelial_cells | BM | Pro-B_cell_CD34+ | HSC_-G-CSF | CMP | Neutrophils | GMP | Erythroblast | Gametocytes | Neurons | Fibroblasts | Smooth_muscle_cells | Hepatocytes | Keratinocytes | Pro-Myelocyte |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
C51 | C | /gpfs/projects/bsc08/shared_projects/scGRN_analysis/Data_home/data/GSE145926_RAW/GSM4475048_C51_filtered_feature_bc_matrix.h5 | 9431 | 8348 | 608 | 215 | 98 | 70 | 68 | 9 | 7 | 4 | 3 | 1 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan |
C52 | C | /gpfs/projects/bsc08/shared_projects/scGRN_analysis/Data_home/data/GSE145926_RAW/GSM4475049_C52_filtered_feature_bc_matrix.h5 | 8696 | 8611 | 13 | 23 | 3 | 14 | 5 | 2 | 25 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan |
C100 | C | /gpfs/projects/bsc08/shared_projects/scGRN_analysis/Data_home/data/GSE145926_RAW/GSM4475050_C100_filtered_feature_bc_matrix.h5 | 907 | 338 | 411 | 45 | 5 | 51 | 20 | 12 | 18 | nan | 2 | nan | 5 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan |
C141 | M | /gpfs/projects/bsc08/shared_projects/scGRN_analysis/Data_home/data/GSE145926_RAW/GSM4339769_C141_filtered_feature_bc_matrix.h5 | 1449 | 197 | 932 | 48 | 5 | 86 | 96 | 33 | 37 | nan | nan | nan | 3 | 11 | 1 | nan | nan | nan | nan | nan | nan | nan | nan |
C142 | M | /gpfs/projects/bsc08/shared_projects/scGRN_analysis/Data_home/data/GSE145926_RAW/GSM4339770_C142_filtered_feature_bc_matrix.h5 | 1790 | 482 | 996 | 39 | 13 | 67 | 113 | 20 | 38 | 1 | 1 | 2 | 3 | 14 | 1 | nan | nan | nan | nan | nan | nan | nan | nan |
C144 | M | /gpfs/projects/bsc08/shared_projects/scGRN_analysis/Data_home/data/GSE145926_RAW/GSM4339772_C144_filtered_feature_bc_matrix.h5 | 452 | 37 | 181 | 41 | 8 | 73 | 34 | 14 | 54 | 2 | 1 | 1 | 1 | 3 | nan | 2 | nan | nan | nan | nan | nan | nan | nan |
C143 | S | /gpfs/projects/bsc08/shared_projects/scGRN_analysis/Data_home/data/GSE145926_RAW/GSM4339771_C143_filtered_feature_bc_matrix.h5 | 14933 | 2048 | 1394 | 154 | 33 | 7489 | 562 | 72 | 145 | nan | 1 | 24 | 1 | 3005 | 2 | nan | 2 | 1 | nan | nan | nan | nan | nan |
C145 | S | /gpfs/projects/bsc08/shared_projects/scGRN_analysis/Data_home/data/GSE145926_RAW/GSM4339773_C145_filtered_feature_bc_matrix.h5 | 15550 | 6960 | 719 | 859 | 46 | 5616 | 421 | 58 | 207 | nan | 1 | 26 | nan | 635 | 2 | nan | nan | nan | nan | nan | nan | nan | nan |
C146 | S | /gpfs/projects/bsc08/shared_projects/scGRN_analysis/Data_home/data/GSE145926_RAW/GSM4339774_C146_filtered_feature_bc_matrix.h5 | 2545 | 247 | 61 | 36 | nan | 127 | 14 | 3 | 417 | nan | nan | 2 | nan | 1632 | nan | nan | 1 | nan | 2 | 1 | 1 | 1 | nan |
C148 | S | /gpfs/projects/bsc08/shared_projects/scGRN_analysis/Data_home/data/GSE145926_RAW/GSM4475051_C148_filtered_feature_bc_matrix.h5 | 1165 | 98 | 122 | 24 | nan | 641 | 36 | 8 | 52 | nan | 1 | nan | 3 | 178 | 1 | nan | 1 | nan | nan | nan | nan | nan | nan |
C149 | S | /gpfs/projects/bsc08/shared_projects/scGRN_analysis/Data_home/data/GSE145926_RAW/GSM4475052_C149_filtered_feature_bc_matrix.h5 | 1936 | 176 | 681 | 80 | 1 | 691 | 59 | 38 | 41 | nan | nan | 5 | nan | 164 | nan | nan | nan | nan | nan | nan | nan | nan | nan |
C152 | S | /gpfs/projects/bsc08/shared_projects/scGRN_analysis/Data_home/data/GSE145926_RAW/GSM4475053_C152_filtered_feature_bc_matrix.h5 | 2557 | 466 | 397 | 41 | 176 | 795 | 74 | 317 | 201 | 6 | 40 | nan | nan | 30 | 13 | nan | nan | nan | nan | nan | nan | nan | 1 |
You can also visualize the cell count for different patients and cell types using various visualization utilities:
scGRN.ana.plot_avail_cell_types(
save_as="EDA_cell_type_dist.pdf"
)
For more examples of use look into examples
and notebooks
.
For further information please contact by mail at mkriukov.job@gmail.com
.