title: "Example code for single-cell analysis with Seurat, day 2" author: "Devika Agarwal, updated by Carla Cohen" date: "25/10/2022" output: html_document

knitr::opts_chunk$set(echo = TRUE)
library(Seurat)
library(tidyverse)
library(patchwork)
library(DT)
library(gprofiler2)

Exercise

Read in the RDS object we created and save from Seurat day 1

use the readRDS() function to read in previously saved object

seurat_after_qc <- readRDS("seurat_after_qc.rds")

Check the Default Assay: RNA

DefaultAssay(seurat_after_qc)

Apply SCTransfrom normalisation

Use SCTransform() function

SCTransform vignette: https://satijalab.org/seurat/articles/sctransform_vignette.html

SCTransform() command replaces NormalizeData(), ScaleData and FindVariableFeatures() run for the RNA assay in day 1 Seurat

Should we remove any confounding variables like we did for the RNA assay for Day 1?

Do we want to use the same number of variable featuresn(n=1000) or more than what we used for NormalizeData() function.

if percent_mt hadn't been calculated in previous qc yet: seurat_after_qc <- PercentageFeatureSet(seurat_after_qc,pattern = "^MT-", col.name = "percent.mt")


seurat_after_qc <- SCTransform(seurat_after_qc,
                               vars.to.regress = "percent_mt", #don't need to regress out nCount_RNA here as the model by principle accounts for sequencing depth 
                               variable.features.n = 2000) #default is 3000

Where is the new normalisation stored? Answer: stored in new assay called 'SCT' ScTransform may discard genes - so cannot just replace 'data' slot, as it is a dataframe with different dimensions

Explore the seurat_after_qc objects metadata and assays

glimpse(seurat_after_qc[[]])

Is there a change? Answer: less genes

Are there new columns in the metadata? Answer: nCount_SCT, nFeature_SCT

Exercise

Visulaisation

The library size and number of features detected per cell is already present in the Seurat object.
When you run SCTransform you get two new variables for library size and features for SCT normalisation
Use the function VlnPlot() to compare (i) RNA assay vs SCT assay library size in a single plot (ii) features detected in RNA vs SCT a single plot
Check to see how you can have fixed y axes (ylims) in the VlnPlot() function

library(cowplot)
library_plot_compare <- VlnPlot(seurat_after_qc, features = c("nCount_RNA","nCount_SCT"), 
                                same.y.lims = TRUE)
feature_plot_compare <- VlnPlot(seurat_after_qc, features = c("nFeature_RNA","nFeature_RNA"),
                                same.y.lims = TRUE)
plot_grid(library_plot_compare, feature_plot_compare, nrow = 2)

Bonus-

Visualise the distribution of SCT corrected count data for a feature of your choice.
Visualise the distribution of SCT logs counts in the data slot for a feature of choice
Visualise the distribution of SCT scale.data count feature of your choice.

Let's choose LYZ like day 1

DefaultAssay(seurat_after_qc) <- "SCT"
ggplot_lyz_raw <- ggplot(FetchData(seurat_after_qc,vars = "LYZ" , slot = "counts"), aes(LYZ)) +
    geom_histogram(fill = "grey", color = "black", bins = 100) +
    coord_cartesian(ylim = c(0, 500)) +
    cowplot::theme_cowplot()
ggplot_lyz_normalised <- ggplot(FetchData(seurat_after_qc,vars = "LYZ", slot = "data"), aes(LYZ))+
    geom_histogram(fill = "grey", color = "black", bins = 100) +
    coord_cartesian(ylim = c(0, 500)) +
    cowplot::theme_cowplot()
ggplot_lyz_scaled <- ggplot(FetchData(seurat_after_qc,vars = "LYZ", slot = "scale.data"),aes(LYZ))+
    geom_histogram(fill = "grey", color = "black", bins = 100) +
    coord_cartesian(ylim = c(0, 500)) +
    cowplot::theme_cowplot()

cowplot::plot_grid(ggplot_lyz_raw, ggplot_lyz_normalised,ggplot_lyz_scaled, ncol = 1)

Use the function VariableFeatures to pull out the 1:10 the variable genes after SCT and compare to 1:10 from the RNA assay

Do we need to change any arguments to get the variables genes specific to the SCT or RNA assay

FindVariableFeatures(seurat_after_qc, assay = "RNA")@assays$RNA@var.features[100:110]
FindVariableFeatures(seurat_after_qc, assay = "SCT")@assays$RNA@var.features[100:110]

How do the two gene lists compare?

Identical for first 10, also for 100:110

Exercise

Dimensionality reduction on SCT transformed data

Run a principal component analysis and UMAP on the Seurat object.
Check the Default assay
Do we want to change the reduction.name argument so that we can still keep the RNA assay based PCA results?

DefaultAssay(seurat_after_qc) <- "SCT"
seurat_after_qc <- RunPCA(seurat_after_qc,
                          reduction.name = "sct.pca")

Check to see what reductions are now present in the object > [1] "pca" "umap" "sct.pca"

Reductions(seurat_after_qc)

First, visualise the amount of variance explained the top principal components for SCTransformed data (number of your choice). How many principal components would you use for downstream analyses?
Do we need to specify the reduction?

ElbowPlot(seurat_after_qc, ndims = 50, reduction = "sct.pca")

Run the UMAP technique on the first 20 principal components stored in the Seurat object and visualise the result as a scatterplot.

How can we change the reduction name from default "umap" to "sct.umap"

How can we specify that we want to use PCA run on the SCT Assay (sct.pca) in the previous step?

seurat_after_qc <- RunUMAP(seurat_after_qc,
                           dims = 1:20, 
                           reduction = "sct.pca",
                           reduction.name = "sct.umap")

Use DimPlot() to plot the umap. What happens if you try to specify different reductions with UMAPPlot?

Compare RNA based umap with sct.umap

p1 <- DimPlot(seurat_after_qc, reduction = "umap") + ggtitle("RNA UMAP")
p2 <- DimPlot(seurat_after_qc, reduction = "sct.umap") + ggtitle("SCT UMAP")
p1 + p2

Exercise

Clustering on SCTransformed data

Compute the graph of nearest neighbours using the function FindNeighbors(). Which principal components are used by default? Instead, specify the number of components that you have chosen. Have you chosen the right reduction and assay?

seurat_after_qc <-  FindNeighbors(seurat_after_qc,
                                  reduction = "sct.pca", 
                                  dims = 1:20, 
                                  assay = "SCT") #don't need to specify names if using different assays, as will append prefix automatically
Graphs(seurat_after_qc)

Finally, compute cluster labels. What is the default setting for the resolution argument? Instead, set it to 0.5. Do you expect more or fewer clusters following that change? What other parameters would you also try to experiment with?
- should we specify the graph.name?

seurat_after_qc <- FindClusters(seurat_after_qc,
                                resolution = 0.5,
                                graph.name = "SCT_snn") #snn (shared nearest neighbour) more robust than nn as bidirectional

Check cluster assignment between SCT and RNA workflow

If you use the same resolution = o.5 and dims as RNA workflow do you get the same number of cluster or more?

Are cells in the same cluster across both RNA and SCT

table(seurat_after_qc$RNA_snn_res.0.5, seurat_after_qc$SCT_snn_res.0.5)

Visualise the SCT cluster labels on the SCT transformed UMAP scatter plot and the RNA cluster labels on the RNA UMAP How would you describe the agreement between the UMAP layout and clustering for SCT vs RNA Assay results?

p1 <- DimPlot(seurat_after_qc,reduction = "umap",group.by = "RNA_snn_res.0.5") + ggtitle("RNA UMAP")
p2 <- DimPlot(seurat_after_qc,reduction = "sct.umap", group.by = "SCT_snn_res.0.5") + ggtitle("SCT UMAP")
p1 + p2

Plot some known cell-type markers for PBMC datasets, does the SCT better separate the celltypes?

CD14+ Monocyte : LYZ, CD14 CD16 Monocytes : FCGR3A, MS4A7 CD4 T : CD4, IL76 CD8 T : CD8A, CD3D NK : GNLY, GZMB,NKG7 B Cell : MS4A1 , CD79A DC : CST3, FCER1A Platelets : PPBP

FeaturePlot(seurat_after_qc, features = c("CD14","FCGR3A","CD4","CD8A","GNLY","MS4A1","CST3","PPBP"), reduction = "umap")
FeaturePlot(seurat_after_qc, features = c("CD14","FCGR3A","CD4","CD8A","GNLY","MS4A1","CST3","PPBP"), reduction = "sct.umap")

FeaturePlot(seurat_after_qc, features = c("IGHA1","IGHG1","IGHG2","IGHG3","IGHD","IGHM","IGHE"), reduction = "umap")
FeaturePlot(seurat_after_qc, features = c("IGHA1","IGHG1","IGHG2","IGHG3","IGHD","IGHM","IGHE"), reduction = "sct.umap")

Calculate the markers for these clusters from either the RNA or SCT assay

Idents(seurat_after_qc) <- "RNA_snn_res.0.5"
seurat_markers_all <- FindAllMarkers(seurat_after_qc,
                                     assay = "RNA", 
                                     logfc.threshold = 0.25, 
                                     min.pct = 0.25)
Idents(seurat_after_qc) <- "SCT_snn_res.0.5"
seurat_markers_all_sct <- FindAllMarkers(seurat_after_qc,
                                         assay = "SCT", 
                                         logfc.threshold = 0.25,
                                         min.pct = 0.25)

write.csv(seurat_markers_all_sct, "seurat_markers_all_sct.csv")
saveRDS(seurat_after_qc, "seurat_after_qc_21022023.rds")

Bonus exercise to try in your own time:: Pathway analysis on Cluster markers for all clusters

Choose either RNA workflow based generated or SCT generated based Seurat marker results
we will be using Gprofiler gost() function for multiple gene lists at the same time
First we need to filter Seurat top significant (p_adj < 0.05) upregulated genes with a LogFC threshold (decided by you) for each cluster, use dplyr:: group_by() and dplyr::filter to get gene list for each cluster, then only select cluster and gene
We then use split() on the filtered_df to divide gene markers into list of multiple character vectors containing genes split by cluster
can refer to the pathway analysis code from previous tutorial, but use human not mouse pathways

First generate the list of markers for each cluster

seurat_clusters_results_filtered <- 
seurat_clusters_list <-

We then run pathway analysis using gost() with multi_query = TRUE

To generate all_genes_id , we use all genes present in either the RNA assay or SCT assay, as we have already filtered out lowly expressed genes not present in certain cells.

# Choose Default assay based on if running pathway analyses on RNA or SCT results
DefaultAssay(seurat_after_qc) <- ""
# create a vector of  of all genes 
all_genes_id <- 

multi_gostquery_results_obj <- gost()

can you plot the results for different clusters together ?


gostplot()

Afternoon Session

Demultiplexing with hashtag oligos (HTOs)

Dataset : 12-HTO dataset from four human cell lines

Dataset represent Data represent single cells collected from four cell lines: HEK, K562, KG1 and THP1
Each cell line was further split into three samples (12 samples in total)
Each sample was labeled with a hashing antibody mixture (CD29 and CD45), pooled, and run on a single lane of 10X.
Based on this design, we should be able to detect doublets both across and within cell types

Load in the UMI matrix for the RNA data and check the dimensions

hto12.umis <- readRDS("hto12_umi_mtx.rds")
head(hto12.umis)

What do rows and columns represent? Answer: rows represent genes, columns represent barcodes

Load in the HTO matrix and check the dimensions

What is the structure of the hto12.htos object? (what is the rows and what is the columns?)
Check the class of the HTO object
How many cell barcodes and HTO samples do we have?

This is really messy - lots of barcodes have shared HTO identities, which is why an algorithm is required to deconvolute this

hto12.htos <- readRDS("hto12_hto_mtx.rds")
head(hto12.htos)

Now we only want to subset to those cell barcodes or cells (actually called as cells by cellRanger or EmptyDrops on the gene expression data) which are detected by both RNA and HTO matrices

Check the class of the joint.bcs object and how many cell barcodes do we have in common

joint.bcs <- intersect(rownames(hto12.htos), colnames(hto12.umis))
class(joint.bcs) #character vector
head(joint.bcs)

Subset the RNA matrix to only the joint.bcs cell barcodes and check the dimensions

hto12.umis.common <- hto12.umis[,joint.bcs]
head(hto12.umis.common) #25088 rather than 30000 cells now

Create a Seurat object with the RNA (UMI count matrix) data using only the joint bcs

Name the object hto12_object Include features detected in at least 3 cells, and cells where at least 200 features detected Normalise with log normalsiation ,find variable genes and Scale RNA data

hto12_object <- CreateSeuratObject(counts = hto12.umis.common,
                                   min.cells =3,
                                   min.features = 200)
hto12_object <- NormalizeData(hto12_object)
hto12_object <- FindVariableFeatures(hto12_object, selection.method = "vst")
hto12_object <- ScaleData(hto12_object, features = VariableFeatures(hto12_object))

Add HTO data as another assay to hto12_object

First subset the HTO matrix to those cell barcodes which are now in the hto12_object Seurat object and make sure that the features only consist of the HTO tags
Is our subsetted hto12.htos.common in the right format? if not what do we do to get it in the right format before adding it as another assay? Answer :

hto12.htos.common <- hto12.htos[colnames(hto12_object),#don't use joint.bcs as we've done some filtering when we created the seurat object
                                1:12] %>% #remove additional metadata
  t() #transpose to set HTOs as features (stored in rows)
glimpse(hto12.htos.common)
class(hto12.htos.common)

Now use CreateAssayObject() to add the subsetted HTO matrix to the already created hto12_object seurat object as a new assay called HTO

Do we want to do any further filtering on the HTO object? Answer :


hto12_object[["HTO"]] <- CreateAssayObject(hto12.htos.common)

Normalise the HTO data , here we will use the CLR transformation with margin =1 (Default setting) CLR: Applies a centered log ratio transformation This is required because the HTO data is bimodal, i.e. the tag is either present or absent on the cell. This is quite different to the RNA counts.

# check the Default Assay
DefaultAssay(hto12_object) <- "HTO"

hto12_object <- NormalizeData(hto12_object, assay = "HTO", 
                              normalization.method = "CLR" , #HTO data is bimodal (each cell should be labelled with either the presence or absence of a label i.e. not normally distributed)
                              margin=1) #margin = 1 ~ by row (margin = 2 ~ by column)

Demultiplex cells based on HTO enrichment

Here we use Seurat Function HTODemux() to assign single cells to their original samples

hto12_object <- HTODemux(hto12_object, 
                         assay = "HTO", 
                         positive.quantile = 0.99) #p value cut-off for results assigning

Checkout the metadata column of the hto12_object, try to read the HTODemux() results output summary in the Value section to understand the results

head(hto12_object[[]]) #HTO_margin is a score representing the difference between HTO_maxID and HTO_secondID -> used to define doublets vs. singlets

Visualise the Demultiplexing results

We can visualise how many cells are classified as singlets, doublets and negative/ambiguous cells

Check the meta.data, which column do we want for this information?

table(hto12_object[[]]$hash.ID)

Visualize enrichment for selected HTOs with ridge plots

plot the max HTO signal for one of the HTO of each of the 4 cell lines (HEK, K562, KG1 and THP1) features with ridge plots using the RidgePlot() function

plot Max HTO signal

# Change the identities of the seurat object to the relevant metadata column

Idents(hto12_object) <- "hash.ID"
RidgePlot(hto12_object,
          assay = "HTO", 
          features = c("KG1-A","HEK-A","THP1-A","K562-A")) #chooseing four out of 12 samples

Visualize pairs of HTO signals to confirm mutual exclusivity in singlets between the same celline

a) plot scatter plot of 2 HTOs within the same cell line e.g. HEK, colour by (single/doublet/negative status)

b) plot scatter plot of 2 HTOs within the same cell line e.g. HEK, colour by HTO_maxID

c) plot scatter plot of 2 HTOs within the same cell line e.g. HEK, colour by HTO_secondID

use the function FeatureScatter()

DefaultAssay(hto12_object) <- "HTO"
FeatureScatter(hto12_object, 
               feature1 = "HEK-A", feature2 = "HEK-B", 
               group.by = "HTO_secondID") # or "HTO_classification.global", "HTO_maxID"

what do you notice ?

1) SecondID should be entireley random and is 2) lowly expressed on both markers = negative

Bonus Exercise

Plot scatter plot of 2 HTOs across different cell lines e.g. K562 vs KG1 and colour by (single/doublet/negative status) and HTO_max ID

Compare number of RNA UMIs for singlets, doublets and negative cells

What is a suitable plot for such comparisons?

Answer:

Idents(hto12_object) <-

Wuestion: what do you notice?

Answer:

Visualize HTO signals in a heatmap , lookup HTOHeatmap()

HTOHeatmap(hto12_object)

What do you notice? good confidence, almost entire heatmap coloured by the extreme ends of the colour spectrum

Generate a two dimensional tSNE or UMAP embedding for HTOs. Here we are grouping cells by singlets and doublets ONLY for simplicity.

Do we need to subset our object?

If so what are we subsetting out?


Idents(hto12_object) <- 
hto12_object.subset <- subset()

Run UMAP/TSNE

what assay are we running UMAP/tsne for ?

look up the arguments in RunUMAP() and/or RunTSNE() functions

check which arguments in RunUMAP/RunUMAP/RunTSNE can be used to change the name of the reduction from defauult name of pca/umap/tsne to custom name

before we Run UMAP, we need to scale and run PCA like we did in the normal single cell workflow

Answer:

# Calculate a tSNE & UMAP embedding of the HTO data
DefaultAssay(hto12_object.subset) <- "HTO"

hto12_object.subset <- RunUMAP()

check the Reductions in the object

Reductions()

Plot the UMAP/tsne for the HTO assay

which reduction shall we plot?

• colour by if singlet/doublet

• colour by HTO final classification results (hash.ID)

check the arguments on how label the clusters by the cluster identity
can change the label size?

what do you notice about the cluustering on tthe UMAP/tsne, does the number of clusters mean anything?

Answer:

what do you notice about the cloud of cells surrounding each cluster?

Answer:

Bonus exercises

You can also visualize the more detailed classification result by group.by HTO_maxID before plotting.

What happens if you group.by the UMAP/TSNE plot by HTO..maxID?

Answer:

Cluster and visualize cells using the usual scRNA-seq workflow, and examine for the potential presence of batch effects.

do we need to rerun FindVariableFeatures() and ScaleData() again?

Answer :

what other steps do we need run to get visualise our RNA data as UMAP/t-SNE coloured by doublets/singlets and celltypes?

Answer:

DefaultAssay(hto12_object.subset) <- "RNA"
# Run PCA on most variable features
hto12_object.subset <- 
hto12_object.subset <- 

hto12_object.subset <- RunPCA(hto12_object.subset)
hto12_object.subset <- RunUMAP(hto12_object.subset, dims = 1:8)

Plot RNA based UMAP

group.by hash.ID

create a new seurat object meta.data column called _cell_line , which removes "_A or _B or _C " in the hash.ID and replaces it with "", to create a new meta.data with only the cell-line info

#we create another metadata column based on the hash.id column, where we gsub the HTO tag info (-A,-B,-C) for each cell line to plot only the cell lien names to see if we have batch effect

hto12_object.subset$cell_line <- gsub(pattern = "[-ABC]")

DimPlot()

what does our RNA based clustering on the UMAP/T-SNE show?

Answer:

Bonus exercise (try in your own time)

create a second seurat object based , using the code above, and rerun the HTODemux() with a different value of positive quantile.

try to check if the classification changes massively if you adjusted the threshold for classification by playing around with the positive.quantile argument from the default.

hannahsfuchs / OBDS_Course

scRNAseq with Seurat day 2 - scTransform and HTO demultiplexing #2

title: "Example code for single-cell analysis with Seurat, day 2" author: "Devika Agarwal, updated by Carla Cohen" date: "25/10/2022" output: html_document

Exercise

Read in the RDS object we created and save from Seurat day 1

Apply SCTransfrom normalisation

Exercise

Visulaisation

Bonus-

Exercise

Dimensionality reduction on SCT transformed data

Exercise

Clustering on SCTransformed data

Bonus exercise to try in your own time:: Pathway analysis on Cluster markers for all clusters

Afternoon Session

Bonus exercises

Bonus exercise (try in your own time)