scRNA-seq Analysis Pipeline
Installation
Dependencies
- NextFlow >= 23.10.1
- Apptainer(singularity); tested with version 1.2.5-1.el7
Script and container download
Script
Request has to be asked beforehand, or can be run on the server under FOLDERNAME
git clone https://github.com/parmejohn/scRNA-pipeline-UofM.git
Apptainer (Singularity)
Image can be downloaded from Sylabs or pulled using
apptainer pull --arch amd64 library://parmejohn/uofm/scrnaseq_singularity:latest
Apptainer and singularity is the same tool, but apptainer is the new naming scheme. Also to note, when using apptainer, library may not be set up which could lead to pull errors if using the command above.
Usage
Run the script from the downloaded directory
nextflow run scRNA_pipeline.nf \
--indir <input folder to sorted cellranger counts> \
--outdir <output folder> \
--species <species name> \
--bind <bind file path locations> \
--reference_seurat [path(s) to labelled reference seurat object] \
--beginning_cluster [inferred earliest celltype] \
--clusters_optimal [optimal number of clusters] \
--resolution [0-9*] \
--run_escape [true/false] \
--pathways [PATHWAY1,PATHWAY2,...] \
--run_sling [true/false] \
--co_conditions [X1,X2,...] \
--reduced_dim [integrated.cca | integrated.mnn | harmony] \
--test_data [0-9*] \
-with-apptainer path_to_image/scrnaseq_singularity_latest.sif
Arguments
- indir: Input directory with folders of condition names, which have cellranger count results in them for each sample (REQUIRED)
- outdir: Where to place analysis/ folder (REQUIRED)
- species: Species name; Must lowercase with no spaces eg. musmusculus homosapiens (REQUIRED)
- bind: Paths where input or output data is moving. This is because apptainer(singularity) without root permission needs explicit file paths. Multiple paths can be defined with a ',' to seperate them. (REQUIRED)
- reference_seurat: Path (or paths seperated by ',') to a labelled reference seurat object. The object(s) will be used to automatically label your own dataset. Note, you should still manually check your clusters to see if the identification was what was expected
- beginning_cluster: If you know what celltype is would be the earliest in a differentiation pathway.
- clusters_optimal: Number of optimal clusters/dimensionality for machine learning methods. If set to 0, will be calculated automatically through NbClust() Calinski-Harabasz (ch) index.
- resolution: Desired resolution, increasing this will increase the number of clusters and vice versa.
- run_escape: Performs escape analysis to provide single-cell GSEA results. IMPORTANT: This analysis takes a substantial amount of time and memory, so ensure that you provide enough of both before setting to 'True'. With Nextflow capabilities as well, it is fine to run the pipeline once without escape and then with '-resume', since the pipeline will cache the rest of the data.
- pathways: List of desired pathways/phrases separated by ',' to search on the Gene onotology: biological processes for MsigDB for the escape analysis. Setting this option will make it so the top 5 pathways will not be printed during escape.
- run_sling: Performs slingshot trajectory inference analysis
- co_conditions: List of co-conditions for each sample separated by ','. Your sample name should contain an underscore for each condition, including the main one being tested for. For instance, SAMPLE1_MUT_DAY50_treated, could have the arguement of '-co_condition time,treatment'
- reduced_dim: Batch correction method to carry out while integrating samples. Possible arguements are "integrated.cca (CCA), integrated.mnn (FastMNN), or harmony"
- DEFAULT: 'integrated.cca'
- test_data: Number of cells to downsample for, if 0 take all cells into account
- with-apptainer: path to downloaded image (REQUIRED)
- When this option is not set, you will use your native environment, which may be missing packages, dependencies, or have version mismatches.
Outputs
Analysis folder:
analysis/
├── data
│ ├── optimal_clusters.txt
│ ├── qc
│ │ ├── CONDITION1
│ │ │ ├── [sample_1]_soupx
│ │ │ │ ├── barcodes.tsv
│ │ │ │ ├── genes.tsv
│ │ │ │ └── matrix.mtx
│ │ │ └── ...
│ │ └── ...
│ ├── sce_slingshot.rds
│ ├── sc_integrated_milo_traj.rds
│ ├── se_filtered_list.rds
│ ├── se_filtered_singlets_list.rds
│ ├── se_integrated_auto_label.rds
│ ├── se_integrated_dimred.rds
│ ├── se_integrated_escape_norm.rds
│ ├── se_integrated_escape.rds
│ ├── se_integrated.rds
│ ├── se_list_raw.rds
│ ├── se_markers_presto_integrated.txt
│ └── ti
│ ├── ti_gene_clusters_slingPseudotime_\*.txt
│ ├── ...
└── plots
├── conserved_marker_unlabelled.pdf
├── da
│ ├── milo_DA_DE_heatmap_\*.pdf
│ ├── milo_DA_fc_distribution.pdf
│ ├── milo_DA_umap.pdf
│ ├── milo_pval_distribution.pdf
│ └── milo_volcano_plot.pdf
├── deseq2
│ ├── deseq2_cluster_[CLUSTER]_[CONDITION1]_vs_[CONDITION2].pdf
│ ├── ...
├── gsea
│ ├── comparative
│ │ ├── gsea_cluster_[CLUSTER]_[CONDITION1]_vs_[CONDITION2].pdf
│ │ ├── ...
│ └── escape
│ ├── escape_heatmap_top5.pdf
│ ├── [CLUSTER]
│ │ ├── GEYSER_PLOT_[1_path].pdf
│ │ ├──...
│ ├── ...
├── integrated_elbow_plot.pdf
├── integrated_umap_grouped.pdf
├── integrated_umap_labelled.pdf
├── integrated_umap_split.pdf
├── integrated_umap_unlabelled.pdf
├── qc
│ ├── [sample1]_soupx_nGenes_nUMI.pdf
│ ├── [sample1]_soupx_percent_mt.pdf
│ ├── ...
├── reference_marker_mapping_heatmap.pdf
├── ti
│ ├── ti_de_slingPseudotime_*.pdf
│ └── ti_start_smooth.pdf OR ti_no_start_not_smooth.pdf
└── top3_markers_expr_heatmap.pdf
Data descriptions:
- optimal_clusters.txt: Contains the number of optimal clusters used for machine learning algorithms throughout the pipline
- qc/: Ambient corrected Cellranger counts using SoupX
- sce_slingshot.rds: SingleCellExperiment R object containing pseudotime calculations for trajectory inference
- sc_integrated_milo_traj.rds: Integrated Seurat object with miloR calculated neighborhoods
- se_filtered_list.rds: Basic QC'd list of Seurat objects; removal of low quality cells from MAD cutoffs of nGenes, nUMI, and percentage of mitochondrial reads
- se_filtered_singlets_list.rds: List of Seurat objects with removed doublets from ScDblFinder
- se_integrated_auto_label.rds: Integrated Seurat object from automatic labelling from reference Seurat object(s)
- se_integrated_dimred.rds: Integrated Seurat object with Seurat dimensional reduction techniques saved (FindClusters, RunPCA, RunUMAP)
- se_integrated_escape_norm.rds: Integrated Seurat object with UCell enrichment scores calculated for each cell, with normalized in respect to the number of genes and unormalized values
- se_integrated.rds: Seurat object that has all samples over all conditions integrated in order to be comparable downstream
- se_list_raw.rds: Seurat object list of corrected for ambient RNA
- se_markers_presto_integrated.txt: Results from FindAllMarkers (presto implementation), where differential expression (1 cluster against the rest) is calculated for each gene.
- ti/:
- ti_gene_clustersslingPseudotime*.txt: Gene names for each cluster in the differential expressed genes according to pseudotime
Plot descriptions:
- conserved_marker_unlabelled.pdf: Conserved markers between conditions; aids in identifying cluseters manually
- da/: Differential abundance analysis through the miloR package. This will test which condition is more prevelant in the different clusters
- milo_DA_DEheatmap*.pdf: Differentially expressed genes that meet a differential abundance cutoff between conditions
- milo_DA_fc_distribution.pdf: Beeswarm plot showing fold-change distribution
- milo_DA_umap.pdf: Single-cell clustered UMAP from Seurat, which is now labelled with the differential abundance across conditions. Each vertice is the amount of cells in a given neighborhood, and each edge is the number of cells shared between the neighbourhoods.
- milo_pval_distribution.pdf: Uncorrected P-value distribution, should have a right skew (anti-conservative) distribution
- milo_volcano_plot.pdf: Visualizes the the SpatialFDR values to see if any neighbourhoods make the cutoff
- deseq2/: Finds differentially expressed genes between conditions for each cluster; calculated with pseudobulk
- gsea/: Gene-set enrichment analyses
- comparative/: DESeq2 (pseudobulk) GSEA results between conditions for each cluster
- escape/: UCell GSEA results calculated on clusters of cells one by one
- escape_heatmap_top5.pdf: Aggregated enrichment scores for each cluster, for the top five GO pathways
- GEYSERPLOT[path].pdf: Shows individual GO paths with individual cells for each cluster. Central dot = median. Thick/thin lines = 66/95% interval summaries
- integrated_elbow_plot.pdf: Rank PCs based on variance percentage, view the elbow for choosing an optimal clustering number
- integrated_umap_grouped.pdf: UMAP of integrated samples separated by conditions. Check if integrated properly and for batch effects
- integrated_umap_labelled.pdf: UMAP of integrated samples with automatic labelling from reference Seurat object(s)
- integrated_umap_split.pdf: UMAP of integrated samples separated by samples
- integrated_umap_unlabelled.pdf: UMAP of integrated samples
- qc/: Preliminary QC graphs to remove low quality cells through MAD values
- [sample1]_soupx_nGenes_nUMI.pdf: Filtered by number of genes and number of UMIs. High amount = potential multiplets; Low amount = Lysed or ambient RNA cells
- [sample1]_soupx_percent_mt.pdf: Filtered by the percentage of mitochodrial reads in the cell. High amount = Lysed cell
- reference_marker_mapping_heatmap.pdf: If provided reference Seurat object(s), a prediction score heatmap will be made to visualize the cell labelling to the unnamed clusters
- ti/:
- ti_deslingPseudotime*.pdf = Differentially expressed genes for across pseudotime values calculated by slingshot
- ti_start_smooth.pdf: Trajectory inference with smooth principal curves; produced when a starting cluster/celltype is explicitly mentioned
- ti_no_start_not_smooth.pdf: Trajectory inference, useful for trying to figure out if cells are differentiating from one cluster to another; direction is not known though
- top3_markers_expr_heatmap.pdf: Top 3 genes for each cluster from FindAllMarkers call
Example
Cisplatin-treated and non-treated mice data.
Counts folder:
counts
├── CTRL
│ ├── pilot_study_C1 -> /home/projects/CIO/yard/run_cellranger_count/pilot_study_C1
│ └── pilot_study_C2 -> /home/projects/CIO/yard/run_cellranger_count/pilot_study_C2
└── TREAT
├── pilot_study_T1 -> /home/projects/CIO/yard/run_cellranger_count/pilot_study_T1
└── pilot_study_T2 -> /home/projects/CIO/yard/run_cellranger_count/pilot_study_T2
Above I am using symlinks to save space (the symlink path also needs to be included in the binded paths). Ensure that REAL paths are used, an error will occur if attempting to use a symlink path name since apptainer/singularity does not run with root access and needs to be explicitly told where the files can be found.
nextflow run scRNA_pipeline.nf \
--indir /home/projects/sc_pipelines/counts/ \
--outdir /home/projects/sc_pipelines/test_run_nf_1 \
--species musmusculus \
--bind /home/projects/,/home/projects/CIO/yard/run_cellranger_count \
--reference_seurat /home/projects/sc_pipelines/analysis/data/se_michalski.rds \
--beginning_cluster Osteoblasts \
-with-apptainer /home/phamj7@med.umanitoba.ca/bin/scrnaseq_singularity.sif
References
Bioinformatics Tools
- batchelor: Haghverdi, L., Lun, A. T. L., Morgan, M. D. & Marioni, J. C. Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Nat Biotechnol 36, 421–427 (2018).
- Comparing batch corrections: Luecken, M. D. et al. Benchmarking atlas-level data integration in single-cell genomics. Nat Methods 19, 41–50 (2022).
- DESeq2: Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology 15, 550 (2014).
- Escape: Borcherding, N. et al. Mapping the immune environment in clear cell renal carcinoma by single-cell genomics. Commun Biol 4, 1–11 (2021).
- fgsea: Korotkevich, G. et al. Fast gene set enrichment analysis. 060012 Preprint at https://doi.org/10.1101/060012 (2021).
- harmony: Korsunsky, I. et al. Fast, sensitive and accurate integration of single-cell data with Harmony. Nat Methods 16, 1289–1296 (2019).
- miloR: Dann, E., Henderson, N. C., Teichmann, S. A., Morgan, M. D. & Marioni, J. C. Differential abundance testing on single-cell data using k-nearest neighbor graphs. Nat Biotechnol 40, 245–253 (2022).
- presto: Korsunsky, I., Nathan, A., Millard, N. & Raychaudhuri, S. Presto scales Wilcoxon and auROC analyses to millions of observations. 653253 Preprint at https://doi.org/10.1101/653253 (2019).
- scDblFinder: Germain, P.-L., Lun, A., Meixide, C. G., Macnair, W. & Robinson, M. D. Doublet identification in single-cell sequencing data using scDblFinder. Preprint at https://doi.org/10.12688/f1000research.73600.2 (2022).
- Seurat: Hao, Y. et al. Dictionary learning for integrative, multimodal and scalable single-cell analysis. Nat Biotechnol 42, 293–304 (2024).
- Slingshot: Street, K. et al. Slingshot: cell lineage and pseudotime inference for single-cell transcriptomics. BMC Genomics 19, 477 (2018).
- SoupX: Young, M. D. & Behjati, S. SoupX removes ambient RNA contamination from droplet-based single-cell RNA sequencing data. GigaScience 9, giaa151 (2020).
- Tempora: Tran, T. N. & Bader, G. D. Tempora: Cell trajectory inference using time-series single-cell RNA sequencing data. PLOS Computational Biology 16, e1008205 (2020).
- tradeSeq: Van den Berge, K. et al. Trajectory-based differential expression analysis for single-cell sequencing data. Nat Commun 11, 1201 (2020).