Use the function Read10X() to import data in the directory filtered_feature_bc_matrix/
as an object named read10x_data.
What class of object does the function return?
read10x_data <- Read10X("/project/obds/shared/resources/4_r_single_cell/singlecell_seuratday1/filtered_feature_bc_matrix")
#change to match Ensembl ID by changing gene.column value, but may be more useful later when it's more clear what is interesting
Answer:
Have a look at the object and its structure (e.g., first 15 rows and 6 columns).
What is a sparse matrix and how does it store data?
class(read10x_data)
read10x_data[1:5,1:16]
Answer:dots are features of sparse matrix - othewise each zero = rowID, colID, content = three values
How many features and barcodes (i.e., cells) are present in the data set?
read10x_data@Dim
glimpse(read10x_data@Dimnames)
Answer: 33538 features, 5155 barcodes
Create a Seurat object using the function CreateSeuratObject()
and the object read10x_data.
Name the object seurat_object.
Include features detected in at least 3 cells,
and cells where at least 200 features detected.
Name the project pbmc5k.
How many features and barcodes are left in the Seurat object?
seurat_object <- CreateSeuratObject(counts = read10x_data,
project = "pmbc5k",
min.cells = 3, #arbitrary - set based on analysis
min.features = 200) #arbitrary - depends on analysis
#project = one dataset, relevant when wanting to merge batches later
seurat_object
Answer: 19037 features, 5100 barcodes
How many features and cells were filtered out when you created the Seurat object?
dim(read10x_data) - dim(seurat_object)
lost 55 barcodes and 14501 features
Exercise
Accessing the contents of a Seurat object
Query the name of the default assay in the Seurat object.
seurat_object@active.assay
DefaultAssay(seurat_object) #if there is a function, generally better to use function, relies on structure staying constant, so code might stop working if developers change it
List the names of assays available in the Seurat object.
seurat_object@assays #only one
Assays(seurat_object) #will return just the names, more user-friendly
Display the first six rows and six columns of the RNA assay data.
What function do you use?
Which arguments do you need to supply, and which ones are optional?
Use the function FetchData() to access the library size and expression of the feature named "LYZ" (first six rows).
What type of object do you get back?
FetchData(seurat_object, vars = c("LYZ","nCount_RNA"), slot = "data")[1:6,]
head(FetchData(seurat_object, vars = c("LYZ"), slot = "data")) #head > [] if only queryign one variable
#variables can come from different location within the R object to generate ggplot-friendly dataframe
#e.g. assays and metadata together here
data.frame
Demo
Common operations on Seurat objects
WhichCells() returns the names of cells that match a logical expression.
WhichCells(seurat_object, expression = LYZ > 500)
VariableFeatures() returns the names of variable features (for a given assay, if computed).
VariableFeatures(seurat_object)
subset() returns a new Seurat object restricted to certain features and cells.
subset(
x = seurat_object,
cells = WhichCells(seurat_object, expression = LYZ > 500),
features = VariableFeatures(object = seurat_object)
)
Exercise
Quality control and visualisation
The library size and number of features detected per cell is already present in the Seurat object.
Use the function VlnPlot() to display them in a single violin plot.
VlnPlot(seurat_object, features = c("nCount_RNA","nFeature_RNA"))
Use the function PercentageFeatureSet() to compute the fraction of reads
assigned to mitochondrial genes in each cell.
Store the metric in the cell metadata of the Seurat object, under the name "percent_mt".
Visualise this new metric alongside the previous two in a new violin plot.
Create a new Seurat object, called seurat_after_qc, that is subsetted to cells that have more than 4,500 UMI counts, less than 15% of UMI counts assigned to mitochondrial features, and more than 1,500 features detected.
How many cells were removed in this step?
Normalise the RNA assay of the Seurat object (after quality control) using the "LogNormalize" method.
seurat_after_qc <- NormalizeData(seurat_after_qc, normalization.method = "LogNormalize")
#default scale.factor is 10,000 (i.e. count per 10,000) - this is an arbitrary number
#if data is good quality, any normalisation approach should give you pretty much the same reusults
#might lose the odd differentially expressed gene, or a bit of precision, but 99% should be the same
Bonus
Visualise the distribution of raw counts and normalised data for a feature of your choice.
Identify variable features in the normalised RNA assay of the Seurat object.
Use the "vst" method and select the 2,000 most variable features.
What does this subsetting do, and what are our motivations for doing it?
Answer:only 2000 genes taken forward in analysis compared to 4202
What is the function to display the name of variable features in a Seurat object (e.g., first 10)?
How can you control which assay the variable features are pull from?
VariableFeatures(seurat_after_qc)[1:10]
Answer:
Use the function VariableFeaturePlot() to visualise the scatter plot of standardised variance against average expression.
How would you use this plot?
VariableFeaturePlot(seurat_after_qc)
Answer: define a sensible cut-off for number of variable features
Scale the normalised RNA assay of the Seurat object, regressing the library size and the fraction of UMI counts assigned to mitochondrial features.
What are the motivations for removing those two sources of variation?
Run a principal component analysis on the Seurat object.
Which features are used by the method in the default settings?
How could you change this?
How do you read the message output of the function RunPCA()?
seurat_after_qc <- RunPCA(seurat_after_qc,
features = NULL, #default = run on variable fetaures
reduction.name = "pca")
Answer:message output: top genes contributing to respective principal components
List the names of dimensionality reduction results available in the Seurat object.
Reductions(seurat_after_qc)
Use PCAPlot() or DimPlot() to produce a scatterplot of the first and second PCA components.
PCAPlot(seurat_after_qc)
Bonus
Make a scatterplot of the first and second PCA components yourself using ggplot2.
# Use this code chunk to prepare a data.frame for ggplot2
pca_data <- FetchData(seurat_after_qc,
vars = c("PC_1","PC_2"))
head(pca_data)
Visualise the amount of variance explained the top principal components (number of your choice).
How many principal components would you use for downstream analyses?
Run the UMAP technique on your selected number of principal components and visualise the result as a scatterplot.
seurat_after_qc <- RunUMAP(seurat_after_qc,
dims = 1:20, #set if features NULL (could say e.g. features c(...) + PC20 )
n.components = 2)
DimPlot(seurat_after_qc, reduction = "umap")
UMAPPlot(seurat_after_qc)
Exercise
Clustering
Compute the graph of nearest neighbours using the function FindNeighbors().
Which principal components are used by default?
Instead, specify the number of principal components that you have chosen earlier.
seurat_after_qc <- FindNeighbors(seurat_after_qc)
#useful to start with default for k - but the larger your dataset, i.e. the more cells
Answer:
The help page states that the function FindNeighbors() uses principal components 1 through 10, by default.
What are the names of the nearest neighbour graphs that are now stored in the Seurat object? RNA_nn
seurat_after_qc@graphs
Finally, compute cluster labels.
What is the default setting for the resolution argument?
Instead, set it to 0.5.
Do you expect more or fewer clusters following that change?
What other parameters would you also try to experiment with?
res <- c(0.3,0.5,0.7,0.9)
seurat_after_qc <- FindClusters(seurat_after_qc,
resolution = res,
algorithm = 1) #Community detection algorithm (default is Louvain)
#introducing resolutions as a vector stores them all as individual columns in metadata
Visualise the cluster labels on the UMAP scatter plot.
How would you describe the agreement between the UMAP layout and the clustering results?
Use the function FindAllMarkers() to identify
positive markers for all clusters,
filtering markers that are detected in at least 25% of the cluster,
and with a log fold-change greater than 0.25.
Assign the result to an object named seurat_markers_all.
What is the class of that object?
How do you control the set of clusters that are used?
Idents(seurat_after_qc) <- "RNA_snn_res.0.5" #make sure to chose the right resolution
seurat_markers_all <- FindAllMarkers(
seurat_after_qc,
features = NULL, #default to use all genes
logfc.threshold = 0.25,
min.pct = 0.25)
class(seurat_markers_all)
Answer:data frame
How do you read the contents of the object seurat_markers_all?
How do you know which features are the markers of each cluster?
Visualise the expression of the top 4 marker for cluster 3 on a UMAP layout.
top4_3 <- seurat_markers_all %>% filter(cluster == 3) %>%
filter(p_val_adj < 0.0001) %>% #could probably do this in FindAllMarkers()too
arrange(desc(avg_log2FC)) %>% #can choose abs(avg_log2FC) too, for heatmap positive changes can be easier to interpret
slice_head(n = 4) %>% #better than top_n() as that will sort on the fly by last variable in the table if not specified
select(gene) %>% unlist()
FeaturePlot(seurat_after_qc,
reduction = "umap",
features = top4_3,
label = TRUE)
Visualise the expression of those same 4 marker as a violin plot.
Do you have any particular preference between the two types of plots?
VlnPlot(seurat_after_qc, features = top4_3)
Answer:
Use DoHeatmap() to visualise the top 10 (positive) markers for each cluster.
Hint: you may want to use the function dplyr::group_by().
title: "Example code for single-cell analysis with Seurat, day 1" author: "Kevin Rue-Albrecht" date: "05/10/2021" output: html_document
Exercise
Import scRNA-seq data and create a Seurat object
Seurat
package.Read10X()
to import data in the directoryfiltered_feature_bc_matrix/
as an object namedread10x_data
. What class of object does the function return?CreateSeuratObject()
and the objectread10x_data
. Name the objectseurat_object
. Include features detected in at least 3 cells, and cells where at least 200 features detected. Name the projectpbmc5k
. How many features and barcodes are left in the Seurat object?Exercise
Accessing the contents of a Seurat object
[[
. What type of object do you get back?$
. What type of object do you get back this time?FetchData()
to access the library size and expression of the feature named"LYZ"
(first six rows). What type of object do you get back?Demo
Common operations on Seurat objects
WhichCells()
returns the names of cells that match a logical expression.VariableFeatures()
returns the names of variable features (for a given assay, if computed).subset()
returns a new Seurat object restricted to certain features and cells.Exercise
Quality control and visualisation
VlnPlot()
to display them in a single violin plot.PercentageFeatureSet()
to compute the fraction of reads assigned to mitochondrial genes in each cell. Store the metric in the cell metadata of the Seurat object, under the name"percent_mt"
. Visualise this new metric alongside the previous two in a new violin plot.seurat_after_qc
, that is subsetted to cells that have more than 4,500 UMI counts, less than 15% of UMI counts assigned to mitochondrial features, and more than 1,500 features detected. How many cells were removed in this step?Exercise
Normalisation
"LogNormalize"
method.Bonus
Exercise
Variable features and scaling
"vst"
method and select the 2,000 most variable features. What does this subsetting do, and what are our motivations for doing it?VariableFeaturePlot()
to visualise the scatter plot of standardised variance against average expression. How would you use this plot?Exercise
Dimensionality reduction
RunPCA()
?PCAPlot()
orDimPlot()
to produce a scatterplot of the first and second PCA components.Bonus
ggplot2
.Exercise
Clustering
FindNeighbors()
. Which principal components are used by default? Instead, specify the number of principal components that you have chosen earlier.resolution
argument? Instead, set it to0.5
. Do you expect more or fewer clusters following that change? What other parameters would you also try to experiment with?Exercise
Identify cluster markers
FindAllMarkers()
to identify positive markers for all clusters, filtering markers that are detected in at least 25% of the cluster, and with a log fold-change greater than0.25
. Assign the result to an object namedseurat_markers_all
. What is the class of that object? How do you control the set of clusters that are used?seurat_markers_all
? How do you know which features are the markers of each cluster?DoHeatmap()
to visualise the top 10 (positive) markers for each cluster. Hint: you may want to use the functiondplyr::group_by()
.