title: "Template code for single-cell analysis using Bioconductor"
author: "Kevin Rue-Albrecht"
date: "05/10/2022"
output: html_document
knitr::opts_chunk$set(echo = TRUE)
Exercise
Import scRNA-seq data and create a SingleCellExperiment object
Import the filtered matrix into R; use DropletUtils.
Note: use the samples= argument of the DropletUtils::read10xCounts() function to give a memorable name to each sample.
Check the difference without using the samples argument.
Print the object.
What can you tell about its contents?
sce
Answer:
What can you tell from the object metadata?
Note: slots of SummarizedExperiment objects are typically accessed using functions of the same name, e.g. metadata().
metadata(sce)
rowData(sce)
colData(sce) #verified that each barcode has been assigned to sample
Answer: not much - just stored file path
Exercise
Quality control
Compute and visualise quality control metrics (library size, genes detected, mitochondrial fraction); use scuttle and/or scater.
Identify mitochondrial genes and pass those to the subsets argument of the scuttle::addPerCellQC() function.
What is the return value?
Where are the quality metrics stored?
What is the difference with scuttle::perCellQCMetrics()?
is.mito <- grep("^MT-", rowData(sce)$Symbol,
value = FALSE) # subset looks for ensmbl rather than gene name, so index more reliable (but useful for checking)
library(scuttle)
sce <- scuttle::addPerCellQC(sce,
subsets = list(MT = is.mito))
rowData(sce)
colData(sce) #this is were MT metrics are specified (+ some overall QC metrics)
#13 mitochondrial genes - in most cells 11/12 i.e. ~90% are detected
Answer:
Visualise library size, genes detected and mitochondrial fraction as three violin plots; use ggplot2.
Filter cells, keeping those with more than 4,500 UMI, less than 15% mitochondrial UMI, and more than 1,500 genes detected. -> decide this based off violin plot
Similarly, use scuttle::perFeatureQCMetrics() or scuttle::addPerFeatureQC() to compute per-feature quality metrics, and visualise those metrics.
sce <- scuttle::addPerFeatureQC(sce)
rowData(sce)
## ggplot2
plot1 <- rowData(sce) %>%
as_tibble() %>%
ggplot() +
geom_violin(aes(x = "Sample", y = mean)) +
labs(x = "mean expression", y = "Value")
plot2 <- rowData(sce) %>%
as_tibble() %>%
ggplot() +
geom_violin(aes(x = "Sample", y = detected)) +
labs(x = "percentege of cells with non-zero count", y = "Value")
plot3 <- rowData(sce) %>%
as_tibble() %>%
ggplot() +
geom_point(aes(y = log(mean), x = detected)) +
labs(y = "mean expression", x = "percentage of cells with non-zero counts")
cowplot::plot_grid(plot1, plot2,plot3, nrow = 1)
Exercise step 3. Normalisation
Convert the counts into normalized expression values to eliminate cell-specific biases (e.g., in capture efficiency); use scuttle and/or scran.
Display the names of the assays available after that step.
Note: use scuttle::logNormCounts() to compute log-normalised counts.
What is the return value?
Where can you find the normalised counts?
library(DelayedMatrixStats)
#
x <- DelayedArray(assay(sce, "counts"))
plot_data <- tibble(
mean = DelayedMatrixStats::rowMeans2(x),
variance = DelayedMatrixStats::rowVars(x)
)
plot_counts <- ggplot(plot_data, aes(x = mean, y = variance)) +
geom_point()
#
x <- DelayedArray(assay(sce, "logcounts"))
plot_data <- tibble(
mean = DelayedMatrixStats::rowMeans2(x),
variance = DelayedMatrixStats::rowVars(x)
)
plot_logcounts <- ggplot(plot_data, aes(x = mean, y = variance)) +
geom_point()
cowplot::plot_grid(plot_counts, plot_logcounts, nrow = 1)
Answer:
When would you rather use scuttle::computePooledFactors instead?
Answer:
Exercise
Feature selection
Select features for downstream analyses, e.g. highly variable genes; use scran.
Use scran::modelGeneVar() to model the variance of the log-expression profiles for each gene.
What is the output?
library(scran)
dec <- scran::modelGeneVar(sce) #assay.type = "logcounts" is default for S4 objects
dec
Answer:
Visualise the relation between the mean expression of each gene and the total / biological / technical variance of each gene.
How do you interpret those different values?
ggplot(as_tibble(dec)) +
geom_point(aes(mean, total), color = "black") + #total variance
geom_point(aes(mean, bio), color = "blue") + # (calculated) biological variance (essentially black - red)
geom_point(aes(mean, tech), color = "red") #technical variance
Answer:
Use scran::getTopHVGs() to identify highly variable genes (e.g., top 10%).
What is the output?
How many genes do you identify?
Where are those genes located in the mean vs. (biological) variance plot?
What happens to this plot if you set more stringent thresholds to define highly variable genes?
Apply PCA; use scater or BiocSingular.
Set a seed to control reproducibility.
List the names of dimensionality reduction results available.
Note: only give the set of highly variable genes to the scater::runPCA() function, to save time, memory, and to focus on biologically informative genes in the data set.
set.seed(1234)
sce <- scater::runPCA(sce,
subset_row = hvg)
percent.var <- attr(reducedDim(sce), "percentVar")
library(PCAtools)
chosen.elbow <- findElbowPoint(percent.var) #PCAtools not available
plot(percent.var)
Apply UMAP and t-SNE successively on the output of the PCA.
List the names of dimensionality reduction results available each time.
sce <- scater::runUMAP(sce,
ncomponents = 2, #default
dimred = "PCA",
n_dimred = 1:20) #number of principal components to use
sce <- scater::runTSNE(sce)
Visualise the scatterplot of cells produced by each of those dimensionality reduction methods.
Considering coloring points with quality control metrics.
Use scran::denoisePCA() to remove principal components that correspond to technical noise, and compare downstream t-SNE or UMAP with those obtained before de-noising.
Name the output sce_denoise.
How many components remain after denoising?
Visualise a UMAP of the denoised PCA and compare.
Start with scran::getClusteredPCs() to cluster cells after using varying number of PCs, and pick the number of PCs using a heuristic based on the number of clusters.
Use scran::buildSNNGraph() and igraph::cluster_louvain() with that "ideal" number of PCs.
Assign the cluster label to a cell metadata column named "label".
g <- scran::buildSNNGraph(sce,
#use.dimred = 'PCA') #use.dimred is the alternative to d
d = 21) #from function above
colData(sce)[["cluster_louvain"]] <- factor(igraph::cluster_louvain(g)$membership)
Visualise the assigned cluster on your preferred dimensionality reduction layout.
Note: Dimensionality reduction and clustering are two separate methods both based on the PCA coordinates.
They may not always agree with each other, often helping to diagnose over- or under-clustering, as well as parameterisation of dimensionality reduction methods.
Use scran::findMarkers() to identify markers for each cluster.
Display the metadata of markers for the first cluster.
markers <- scran::findMarkers(sce,
groups = sce$snn_d,
test.type = "t") #default = t test (how valid assumption of normality), other options wilcox (non-parametric - less genes but more confidence), binom - data so sparse in scRNAseq that rank more appropriate maybe
#wilcoxon test does not give fold changes - but these are unreliable anyways, as
#simple list: more compact than list but same behaviour
markers[[1]]
title: "Template code for single-cell analysis using Bioconductor" author: "Kevin Rue-Albrecht" date: "05/10/2022" output: html_document
Exercise
Import scRNA-seq data and create a SingleCellExperiment object
DropletUtils
.Note: use the
samples=
argument of theDropletUtils::read10xCounts()
function to give a memorable name to each sample. Check the difference without using thesamples
argument.Note: slots of
SummarizedExperiment
objects are typically accessed using functions of the same name, e.g.metadata()
.Exercise
Quality control
Compute and visualise quality control metrics (library size, genes detected, mitochondrial fraction); use
scuttle
and/orscater
.Identify mitochondrial genes and pass those to the
subsets
argument of thescuttle::addPerCellQC()
function.What is the return value? Where are the quality metrics stored? What is the difference with
scuttle::perCellQCMetrics()
?ggplot2
.scuttle::perFeatureQCMetrics()
orscuttle::addPerFeatureQC()
to compute per-feature quality metrics, and visualise those metrics.Exercise step 3. Normalisation
scuttle
and/orscran
. Display the names of the assays available after that step.Note: use
scuttle::logNormCounts()
to compute log-normalised counts. What is the return value? Where can you find the normalised counts?Note: how can you tell whether the normalisation was effective? Compare with https://osca.bioconductor.org/feature-selection.html#quantifying-per-gene-variation
scuttle::computePooledFactors
instead?Exercise
Feature selection
Select features for downstream analyses, e.g. highly variable genes; use
scran
.scran::modelGeneVar()
to model the variance of the log-expression profiles for each gene. What is the output?How do you interpret those different values?
scran::getTopHVGs()
to identify highly variable genes (e.g., top 10%).What is the output? How many genes do you identify? Where are those genes located in the mean vs. (biological) variance plot? What happens to this plot if you set more stringent thresholds to define highly variable genes?
Exercise
Dimensionality reduction
scater
orBiocSingular
. Set a seed to control reproducibility. List the names of dimensionality reduction results available.Note: only give the set of highly variable genes to the
scater::runPCA()
function, to save time, memory, and to focus on biologically informative genes in the data set.Bonus point
scran::denoisePCA()
to remove principal components that correspond to technical noise, and compare downstream t-SNE or UMAP with those obtained before de-noising. Name the outputsce_denoise
. How many components remain after denoising? Visualise a UMAP of the denoised PCA and compare.Exercise
Clustering
Cluster cells using
scran
.scran::getClusteredPCs()
to cluster cells after using varying number of PCs, and pick the number of PCs using a heuristic based on the number of clusters.scran::buildSNNGraph()
andigraph::cluster_louvain()
with that "ideal" number of PCs. Assign the cluster label to a cell metadata column named"label"
.Note: Dimensionality reduction and clustering are two separate methods both based on the PCA coordinates. They may not always agree with each other, often helping to diagnose over- or under-clustering, as well as parameterisation of dimensionality reduction methods.
Bonus point
scran::quickCluster()
; identify key parameters and compare results.Exercise
Cluster markers
scran::findMarkers()
to identify markers for each cluster. Display the metadata of markers for the first cluster.Visualise the expression of selected markers:
Exercise
Interactive visualisation
iSEE::iSEE()
to launch an interactive web-application to visualise the contents of theSingleCellExperiment
object.Bonus point