title: "Example code for single-cell analysis with Seurat, day 1" author: "Kevin Rue-Albrecht" date: "05/10/2021" output: html_document

knitr::opts_chunk$set(echo = TRUE)
library(Seurat)
library(tidyverse)

Exercise

Import scRNA-seq data and create a Seurat object

Load the Seurat package.

Use the function Read10X() to import data in the directory filtered_feature_bc_matrix/ as an object named read10x_data. What class of object does the function return?

read10x_data <- Read10X("/project/obds/shared/resources/4_r_single_cell/singlecell_seuratday1/filtered_feature_bc_matrix")
#change to match Ensembl ID by changing gene.column value, but may be more useful later when it's more clear what is interesting

Answer:

Have a look at the object and its structure (e.g., first 15 rows and 6 columns). What is a sparse matrix and how does it store data?

class(read10x_data)

read10x_data[1:5,1:16]

Answer:dots are features of sparse matrix - othewise each zero = rowID, colID, content = three values

How many features and barcodes (i.e., cells) are present in the data set?

read10x_data@Dim
glimpse(read10x_data@Dimnames)

Answer: 33538 features, 5155 barcodes

Create a Seurat object using the function CreateSeuratObject() and the object read10x_data. Name the object seurat_object. Include features detected in at least 3 cells, and cells where at least 200 features detected. Name the project pbmc5k. How many features and barcodes are left in the Seurat object?

seurat_object <- CreateSeuratObject(counts = read10x_data, 
                                    project = "pmbc5k", 
                                    min.cells = 3, #arbitrary - set based on analysis
                                    min.features = 200) #arbitrary - depends on analysis 
#project = one dataset, relevant when wanting to merge batches later 
seurat_object

Answer: 19037 features, 5100 barcodes

How many features and cells were filtered out when you created the Seurat object?

dim(read10x_data) - dim(seurat_object)

lost 55 barcodes and 14501 features

Exercise

Accessing the contents of a Seurat object

Query the name of the default assay in the Seurat object.

seurat_object@active.assay
DefaultAssay(seurat_object) #if there is a function, generally better to use function, relies on structure staying constant, so code might stop working if developers change it

List the names of assays available in the Seurat object.

seurat_object@assays #only one 
Assays(seurat_object) #will return just the names, more user-friendly

Display the first six rows and six columns of the RNA assay data. What function do you use? Which arguments do you need to supply, and which ones are optional?

seurat_object@assays$RNA@data[1:6,1:6] #see above
GetAssayData(seurat_object,assay = "RNA",slot = "data")[1:6,1:6]

Answer:

Display the entire data.frame of per-cell metadata (first six rows). What column names do you see?

seurat_object[[]][1:6,]
head(seurat_object[[]])

Answer:orig.ident, nCount_RNA, nFeature_RNA

Fetch one column of metadata using [[. What type of object do you get back?

seurat_object[["nCount_RNA"]]
class(seurat_object[["nCount_RNA"]])

Answer: factor

Instead,fetch the same column of metadata using $. What type of object do you get back this time?

head(seurat_object$nCount_RNA)
class(seurat_object$nCount_RNA)

Answer: numeric - named vector

Use the function FetchData() to access the library size and expression of the feature named "LYZ" (first six rows). What type of object do you get back?

FetchData(seurat_object, vars = c("LYZ","nCount_RNA"), slot = "data")[1:6,]
head(FetchData(seurat_object, vars = c("LYZ"), slot = "data")) #head > [] if only queryign one variable

#variables can come from different location within the R object to generate ggplot-friendly dataframe
#e.g. assays and metadata together here

data.frame

Demo

Common operations on Seurat objects

WhichCells() returns the names of cells that match a logical expression.

WhichCells(seurat_object, expression = LYZ > 500)

VariableFeatures() returns the names of variable features (for a given assay, if computed).

VariableFeatures(seurat_object)

subset() returns a new Seurat object restricted to certain features and cells.

subset(
    x = seurat_object,
    cells = WhichCells(seurat_object, expression = LYZ > 500),
    features = VariableFeatures(object = seurat_object)
)

Exercise

Quality control and visualisation

The library size and number of features detected per cell is already present in the Seurat object. Use the function VlnPlot() to display them in a single violin plot.

VlnPlot(seurat_object, features = c("nCount_RNA","nFeature_RNA"))

Use the function PercentageFeatureSet() to compute the fraction of reads assigned to mitochondrial genes in each cell. Store the metric in the cell metadata of the Seurat object, under the name "percent_mt". Visualise this new metric alongside the previous two in a new violin plot.

seurat_object[["percent.mt"]] <- PercentageFeatureSet(seurat_object,pattern = "^MT-")
VlnPlot(seurat_object, features = c("nCount_RNA", "nFeature_RNA", "percent.mt"))

Visualise a scatter plot of the proportion of mitochondrial UMIs against the library size in each cell.

FeatureScatter(seurat_object, "percent.mt","nFeature_RNA")

Create a new Seurat object, called seurat_after_qc, that is subsetted to cells that have more than 4,500 UMI counts, less than 15% of UMI counts assigned to mitochondrial features, and more than 1,500 features detected. How many cells were removed in this step?

seurat_after_qc <- subset(seurat_object, 
                          subset = nCount_RNA > 4500 & percent.mt < 15 & nFeature_RNA > 1500)
seurat_after_qc

dim(seurat_object) - dim(seurat_after_qc)

Answer: 896 cells removed

Exercise

Normalisation

Normalise the RNA assay of the Seurat object (after quality control) using the "LogNormalize" method.

seurat_after_qc <- NormalizeData(seurat_after_qc, normalization.method = "LogNormalize")
#default scale.factor is 10,000 (i.e. count per 10,000) - this is an arbitrary number
#if data is good quality, any normalisation approach should give you pretty much the same reusults
#might lose the odd differentially expressed gene, or a bit of precision, but 99% should be the same

Bonus

Visualise the distribution of raw counts and normalised data for a feature of your choice.

GetAssayData(seurat_object, slot = "counts") 
GetAssayData(seurat_after_qc, slot = "counts")

ggplot_lyz_raw <- ggplot(FetchData(seurat_object, vars = "LYZ", slot = "data"),
                         aes(x = LYZ)) +
    geom_histogram() +
    coord_cartesian(ylim = c(0, 500)) +
    cowplot::theme_cowplot()
ggplot_lyz_normalised <- ggplot(FetchData(seurat_after_qc, vars = "LYZ", slot = "data"),
                         aes(x = LYZ)) +
    geom_histogram() +
    coord_cartesian(ylim = c(0, 500)) +
    cowplot::theme_cowplot()
cowplot::plot_grid(ggplot_lyz_raw, ggplot_lyz_normalised, ncol = 1)

Exercise

Variable features and scaling

Identify variable features in the normalised RNA assay of the Seurat object. Use the "vst" method and select the 2,000 most variable features. What does this subsetting do, and what are our motivations for doing it?

seurat_after_qc <- FindVariableFeatures(seurat_after_qc,
                                        selection.method = "vst",
                                        nfeatures = 2000)

Answer:only 2000 genes taken forward in analysis compared to 4202

What is the function to display the name of variable features in a Seurat object (e.g., first 10)? How can you control which assay the variable features are pull from?

VariableFeatures(seurat_after_qc)[1:10]

Answer:

Use the function VariableFeaturePlot() to visualise the scatter plot of standardised variance against average expression. How would you use this plot?

VariableFeaturePlot(seurat_after_qc)

Answer: define a sensible cut-off for number of variable features

Scale the normalised RNA assay of the Seurat object, regressing the library size and the fraction of UMI counts assigned to mitochondrial features. What are the motivations for removing those two sources of variation?

seurat_after_qc <- ScaleData(seurat_after_qc,
                             vars.to.regress = c("nCount_RNA","percent.mt"))

Answer:

Exercise

Dimensionality reduction

Run a principal component analysis on the Seurat object. Which features are used by the method in the default settings? How could you change this? How do you read the message output of the function RunPCA()?

seurat_after_qc <- RunPCA(seurat_after_qc, 
                          features = NULL, #default = run on variable fetaures 
                          reduction.name = "pca")

Answer:message output: top genes contributing to respective principal components

List the names of dimensionality reduction results available in the Seurat object.

Reductions(seurat_after_qc)

Use PCAPlot() or DimPlot() to produce a scatterplot of the first and second PCA components.

PCAPlot(seurat_after_qc)

Bonus

Make a scatterplot of the first and second PCA components yourself using ggplot2.

# Use this code chunk to prepare a data.frame for ggplot2
pca_data <- FetchData(seurat_after_qc,
                      vars = c("PC_1","PC_2"))
head(pca_data)

ggplot(pca_data,aes(x = PC_1, y = PC_2)) +
    geom_point(size = 0.2) +
    cowplot::theme_cowplot()

Visualise the amount of variance explained the top principal components (number of your choice). How many principal components would you use for downstream analyses?

ElbowPlot(seurat_after_qc, ndims = 50, reduction = "pca")

would use 20 principal components

Run the UMAP technique on your selected number of principal components and visualise the result as a scatterplot.

seurat_after_qc <- RunUMAP(seurat_after_qc, 
                           dims = 1:20, #set if features NULL (could say e.g. features c(...) + PC20 )
                           n.components = 2)
DimPlot(seurat_after_qc, reduction = "umap")
UMAPPlot(seurat_after_qc)

Exercise

Clustering

Compute the graph of nearest neighbours using the function FindNeighbors(). Which principal components are used by default? Instead, specify the number of principal components that you have chosen earlier.

seurat_after_qc <- FindNeighbors(seurat_after_qc)
#useful to start with default for k - but the larger your dataset, i.e. the more cells

Answer:

The help page states that the function FindNeighbors() uses principal components 1 through 10, by default.

What are the names of the nearest neighbour graphs that are now stored in the Seurat object? RNA_nn

seurat_after_qc@graphs

Finally, compute cluster labels. What is the default setting for the resolution argument? Instead, set it to 0.5. Do you expect more or fewer clusters following that change? What other parameters would you also try to experiment with?

res <- c(0.3,0.5,0.7,0.9)
seurat_after_qc <- FindClusters(seurat_after_qc, 
                                resolution  = res, 
                                algorithm = 1) #Community detection algorithm (default is Louvain)

#introducing resolutions as a vector stores them all as individual columns in metadata

Visualise the cluster labels on the UMAP scatter plot. How would you describe the agreement between the UMAP layout and the clustering results?

library(cowplot)
cluster_resolution_plots <- lapply(
  res, 
  FUN = function(x) UMAPPlot(seurat_after_qc, group.by = paste0("RNA_snn_res.",x), label = TRUE))
plot_grid(
  cluster_resolution_plots[[1]],
  cluster_resolution_plots[[2]],
  cluster_resolution_plots[[3]],
  cluster_resolution_plots[[4]],
  ncol = 2
)

Exercise

Identify cluster markers

Use the function FindAllMarkers() to identify positive markers for all clusters, filtering markers that are detected in at least 25% of the cluster, and with a log fold-change greater than 0.25. Assign the result to an object named seurat_markers_all. What is the class of that object? How do you control the set of clusters that are used?

Idents(seurat_after_qc) <- "RNA_snn_res.0.5" #make sure to chose the right resolution
seurat_markers_all <- FindAllMarkers(
    seurat_after_qc,
    features = NULL, #default to use all genes
    logfc.threshold = 0.25,
    min.pct = 0.25)
class(seurat_markers_all)

Answer:data frame

How do you read the contents of the object seurat_markers_all? How do you know which features are the markers of each cluster?

head(seurat_markers_all)

Answer: cluster column gives cluster identity. pct1 - expression percentag ewithin cluster, pct.2 = expression outside of cluster

Filter and display the top 10 markers for cluster 3.

seurat_markers_all %>% group_by(cluster) %>% arrange(abs(avg_log2FC)) %>% top_n(3)

Visualise the expression of the top 4 marker for cluster 3 on a UMAP layout.


top4_3 <- seurat_markers_all %>% filter(cluster == 3) %>% 
  filter(p_val_adj < 0.0001) %>% #could probably do this in FindAllMarkers()too
  arrange(desc(avg_log2FC)) %>% #can choose abs(avg_log2FC) too, for heatmap positive changes can be easier to interpret
  slice_head(n = 4) %>% #better than top_n() as that will sort on the fly by last variable in the table if not specified 
  select(gene) %>% unlist()
FeaturePlot(seurat_after_qc, 
            reduction = "umap",
            features = top4_3, 
            label = TRUE)

Visualise the expression of those same 4 marker as a violin plot. Do you have any particular preference between the two types of plots?

VlnPlot(seurat_after_qc, features = top4_3)

Answer:

Use DoHeatmap() to visualise the top 10 (positive) markers for each cluster. Hint: you may want to use the function dplyr::group_by().

markers_top10_clusters <- seurat_markers_all %>%
  filter(p_val_adj < 0.0001) %>% 
  group_by(cluster) %>%
  arrange(desc(avg_log2FC)) %>%
  slice_head(n = 10) 

DoHeatmap(seurat_after_qc,
          features = markers_top10_clusters$gene)

hannahsfuchs / OBDS_Course

scRNAseq with Seurat #1

title: "Example code for single-cell analysis with Seurat, day 1" author: "Kevin Rue-Albrecht" date: "05/10/2021" output: html_document

Exercise

Import scRNA-seq data and create a Seurat object

Exercise

Accessing the contents of a Seurat object

Demo

Common operations on Seurat objects

Exercise

Quality control and visualisation

Exercise

Normalisation

Bonus

Exercise

Variable features and scaling

Exercise

Dimensionality reduction

Bonus

Exercise

Clustering

Exercise

Identify cluster markers