hannahsfuchs / OBDS_Course

0 stars 0 forks source link

scRNAseq with Seurat day 2 - scTransform and HTO demultiplexing #2

Open hannahsfuchs opened 1 year ago

hannahsfuchs commented 1 year ago

title: "Example code for single-cell analysis with Seurat, day 2" author: "Devika Agarwal, updated by Carla Cohen" date: "25/10/2022" output: html_document

knitr::opts_chunk$set(echo = TRUE)
library(Seurat)
library(tidyverse)
library(patchwork)
library(DT)
library(gprofiler2)

Exercise

Read in the RDS object we created and save from Seurat day 1

seurat_after_qc <- readRDS("seurat_after_qc.rds")
DefaultAssay(seurat_after_qc)

Apply SCTransfrom normalisation

Use SCTransform() function

SCTransform vignette: https://satijalab.org/seurat/articles/sctransform_vignette.html

SCTransform() command replaces NormalizeData(), ScaleData and FindVariableFeatures() run for the RNA assay in day 1 Seurat

Should we remove any confounding variables like we did for the RNA assay for Day 1?

Do we want to use the same number of variable featuresn(n=1000) or more than what we used for NormalizeData() function.

if percent_mt hadn't been calculated in previous qc yet: seurat_after_qc <- PercentageFeatureSet(seurat_after_qc,pattern = "^MT-", col.name = "percent.mt")


seurat_after_qc <- SCTransform(seurat_after_qc,
                               vars.to.regress = "percent_mt", #don't need to regress out nCount_RNA here as the model by principle accounts for sequencing depth 
                               variable.features.n = 2000) #default is 3000

Where is the new normalisation stored? Answer: stored in new assay called 'SCT' ScTransform may discard genes - so cannot just replace 'data' slot, as it is a dataframe with different dimensions

Explore the seurat_after_qc objects metadata and assays

glimpse(seurat_after_qc[[]])

Is there a change? Answer: less genes

Are there new columns in the metadata? Answer: nCount_SCT, nFeature_SCT

Exercise

Visulaisation

library(cowplot)
library_plot_compare <- VlnPlot(seurat_after_qc, features = c("nCount_RNA","nCount_SCT"), 
                                same.y.lims = TRUE)
feature_plot_compare <- VlnPlot(seurat_after_qc, features = c("nFeature_RNA","nFeature_RNA"),
                                same.y.lims = TRUE)
plot_grid(library_plot_compare, feature_plot_compare, nrow = 2)

Bonus-

Let's choose LYZ like day 1

DefaultAssay(seurat_after_qc) <- "SCT"
ggplot_lyz_raw <- ggplot(FetchData(seurat_after_qc,vars = "LYZ" , slot = "counts"), aes(LYZ)) +
    geom_histogram(fill = "grey", color = "black", bins = 100) +
    coord_cartesian(ylim = c(0, 500)) +
    cowplot::theme_cowplot()
ggplot_lyz_normalised <- ggplot(FetchData(seurat_after_qc,vars = "LYZ", slot = "data"), aes(LYZ))+
    geom_histogram(fill = "grey", color = "black", bins = 100) +
    coord_cartesian(ylim = c(0, 500)) +
    cowplot::theme_cowplot()
ggplot_lyz_scaled <- ggplot(FetchData(seurat_after_qc,vars = "LYZ", slot = "scale.data"),aes(LYZ))+
    geom_histogram(fill = "grey", color = "black", bins = 100) +
    coord_cartesian(ylim = c(0, 500)) +
    cowplot::theme_cowplot()

cowplot::plot_grid(ggplot_lyz_raw, ggplot_lyz_normalised,ggplot_lyz_scaled, ncol = 1)

Use the function VariableFeatures to pull out the 1:10 the variable genes after SCT and compare to 1:10 from the RNA assay

Do we need to change any arguments to get the variables genes specific to the SCT or RNA assay

FindVariableFeatures(seurat_after_qc, assay = "RNA")@assays$RNA@var.features[100:110]
FindVariableFeatures(seurat_after_qc, assay = "SCT")@assays$RNA@var.features[100:110]

How do the two gene lists compare?

Identical for first 10, also for 100:110

Exercise

Dimensionality reduction on SCT transformed data

DefaultAssay(seurat_after_qc) <- "SCT"
seurat_after_qc <- RunPCA(seurat_after_qc,
                          reduction.name = "sct.pca")

Check to see what reductions are now present in the object > [1] "pca" "umap" "sct.pca"

Reductions(seurat_after_qc)
ElbowPlot(seurat_after_qc, ndims = 50, reduction = "sct.pca")

How can we change the reduction name from default "umap" to "sct.umap"

How can we specify that we want to use PCA run on the SCT Assay (sct.pca) in the previous step?

seurat_after_qc <- RunUMAP(seurat_after_qc,
                           dims = 1:20, 
                           reduction = "sct.pca",
                           reduction.name = "sct.umap")

Use DimPlot() to plot the umap. What happens if you try to specify different reductions with UMAPPlot?

Compare RNA based umap with sct.umap

p1 <- DimPlot(seurat_after_qc, reduction = "umap") + ggtitle("RNA UMAP")
p2 <- DimPlot(seurat_after_qc, reduction = "sct.umap") + ggtitle("SCT UMAP")
p1 + p2

Exercise

Clustering on SCTransformed data

seurat_after_qc <-  FindNeighbors(seurat_after_qc,
                                  reduction = "sct.pca", 
                                  dims = 1:20, 
                                  assay = "SCT") #don't need to specify names if using different assays, as will append prefix automatically
Graphs(seurat_after_qc)
seurat_after_qc <- FindClusters(seurat_after_qc,
                                resolution = 0.5,
                                graph.name = "SCT_snn") #snn (shared nearest neighbour) more robust than nn as bidirectional

Check cluster assignment between SCT and RNA workflow

If you use the same resolution = o.5 and dims as RNA workflow do you get the same number of cluster or more?

Are cells in the same cluster across both RNA and SCT

table(seurat_after_qc$RNA_snn_res.0.5, seurat_after_qc$SCT_snn_res.0.5)
p1 <- DimPlot(seurat_after_qc,reduction = "umap",group.by = "RNA_snn_res.0.5") + ggtitle("RNA UMAP")
p2 <- DimPlot(seurat_after_qc,reduction = "sct.umap", group.by = "SCT_snn_res.0.5") + ggtitle("SCT UMAP")
p1 + p2

Plot some known cell-type markers for PBMC datasets, does the SCT better separate the celltypes?

CD14+ Monocyte : LYZ, CD14 CD16 Monocytes : FCGR3A, MS4A7 CD4 T : CD4, IL76 CD8 T : CD8A, CD3D NK : GNLY, GZMB,NKG7 B Cell : MS4A1 , CD79A DC : CST3, FCER1A Platelets : PPBP

FeaturePlot(seurat_after_qc, features = c("CD14","FCGR3A","CD4","CD8A","GNLY","MS4A1","CST3","PPBP"), reduction = "umap")
FeaturePlot(seurat_after_qc, features = c("CD14","FCGR3A","CD4","CD8A","GNLY","MS4A1","CST3","PPBP"), reduction = "sct.umap")

FeaturePlot(seurat_after_qc, features = c("IGHA1","IGHG1","IGHG2","IGHG3","IGHD","IGHM","IGHE"), reduction = "umap")
FeaturePlot(seurat_after_qc, features = c("IGHA1","IGHG1","IGHG2","IGHG3","IGHD","IGHM","IGHE"), reduction = "sct.umap")

Calculate the markers for these clusters from either the RNA or SCT assay

Idents(seurat_after_qc) <- "RNA_snn_res.0.5"
seurat_markers_all <- FindAllMarkers(seurat_after_qc,
                                     assay = "RNA", 
                                     logfc.threshold = 0.25, 
                                     min.pct = 0.25)
Idents(seurat_after_qc) <- "SCT_snn_res.0.5"
seurat_markers_all_sct <- FindAllMarkers(seurat_after_qc,
                                         assay = "SCT", 
                                         logfc.threshold = 0.25,
                                         min.pct = 0.25)

write.csv(seurat_markers_all_sct, "seurat_markers_all_sct.csv")
saveRDS(seurat_after_qc, "seurat_after_qc_21022023.rds")

Bonus exercise to try in your own time:: Pathway analysis on Cluster markers for all clusters

First generate the list of markers for each cluster

seurat_clusters_results_filtered <- 
seurat_clusters_list <- 

We then run pathway analysis using gost() with multi_query = TRUE

# Choose Default assay based on if running pathway analyses on RNA or SCT results
DefaultAssay(seurat_after_qc) <- ""
# create a vector of  of all genes 
all_genes_id <- 

multi_gostquery_results_obj <- gost()

can you plot the results for different clusters together ?


gostplot()

Afternoon Session

Demultiplexing with hashtag oligos (HTOs)

Dataset : 12-HTO dataset from four human cell lines

Load in the UMI matrix for the RNA data and check the dimensions

hto12.umis <- readRDS("hto12_umi_mtx.rds")
head(hto12.umis)

What do rows and columns represent? Answer: rows represent genes, columns represent barcodes

Load in the HTO matrix and check the dimensions

This is really messy - lots of barcodes have shared HTO identities, which is why an algorithm is required to deconvolute this

hto12.htos <- readRDS("hto12_hto_mtx.rds")
head(hto12.htos) 

Now we only want to subset to those cell barcodes or cells (actually called as cells by cellRanger or EmptyDrops on the gene expression data) which are detected by both RNA and HTO matrices

joint.bcs <- intersect(rownames(hto12.htos), colnames(hto12.umis))
class(joint.bcs) #character vector
head(joint.bcs)

Subset the RNA matrix to only the joint.bcs cell barcodes and check the dimensions

hto12.umis.common <- hto12.umis[,joint.bcs]
head(hto12.umis.common) #25088 rather than 30000 cells now
hto12_object <- CreateSeuratObject(counts = hto12.umis.common,
                                   min.cells =3,
                                   min.features = 200)
hto12_object <- NormalizeData(hto12_object)
hto12_object <- FindVariableFeatures(hto12_object, selection.method = "vst")
hto12_object <- ScaleData(hto12_object, features = VariableFeatures(hto12_object))

Add HTO data as another assay to hto12_object

hto12.htos.common <- hto12.htos[colnames(hto12_object),#don't use joint.bcs as we've done some filtering when we created the seurat object
                                1:12] %>% #remove additional metadata
  t() #transpose to set HTOs as features (stored in rows)
glimpse(hto12.htos.common)
class(hto12.htos.common)

Now use CreateAssayObject() to add the subsetted HTO matrix to the already created hto12_object seurat object as a new assay called HTO


hto12_object[["HTO"]] <- CreateAssayObject(hto12.htos.common)

Normalise the HTO data , here we will use the CLR transformation with margin =1 (Default setting) CLR: Applies a centered log ratio transformation This is required because the HTO data is bimodal, i.e. the tag is either present or absent on the cell. This is quite different to the RNA counts.

# check the Default Assay
DefaultAssay(hto12_object) <- "HTO"

hto12_object <- NormalizeData(hto12_object, assay = "HTO", 
                              normalization.method = "CLR" , #HTO data is bimodal (each cell should be labelled with either the presence or absence of a label i.e. not normally distributed)
                              margin=1) #margin = 1 ~ by row (margin = 2 ~ by column)

Demultiplex cells based on HTO enrichment

Here we use Seurat Function HTODemux() to assign single cells to their original samples

hto12_object <- HTODemux(hto12_object, 
                         assay = "HTO", 
                         positive.quantile = 0.99) #p value cut-off for results assigning

Checkout the metadata column of the hto12_object, try to read the HTODemux() results output summary in the Value section to understand the results

head(hto12_object[[]]) #HTO_margin is a score representing the difference between HTO_maxID and HTO_secondID -> used to define doublets vs. singlets

Visualise the Demultiplexing results

We can visualise how many cells are classified as singlets, doublets and negative/ambiguous cells

Check the meta.data, which column do we want for this information?

table(hto12_object[[]]$hash.ID)

Visualize enrichment for selected HTOs with ridge plots

plot the max HTO signal for one of the HTO of each of the 4 cell lines (HEK, K562, KG1 and THP1) features with ridge plots using the RidgePlot() function

plot Max HTO signal

# Change the identities of the seurat object to the relevant metadata column

Idents(hto12_object) <- "hash.ID"
RidgePlot(hto12_object,
          assay = "HTO", 
          features = c("KG1-A","HEK-A","THP1-A","K562-A")) #chooseing four out of 12 samples

Visualize pairs of HTO signals to confirm mutual exclusivity in singlets between the same celline

a) plot scatter plot of 2 HTOs within the same cell line e.g. HEK, colour by (single/doublet/negative status)

b) plot scatter plot of 2 HTOs within the same cell line e.g. HEK, colour by HTO_maxID

c) plot scatter plot of 2 HTOs within the same cell line e.g. HEK, colour by HTO_secondID

use the function FeatureScatter()

DefaultAssay(hto12_object) <- "HTO"
FeatureScatter(hto12_object, 
               feature1 = "HEK-A", feature2 = "HEK-B", 
               group.by = "HTO_secondID") # or "HTO_classification.global", "HTO_maxID"

what do you notice ?

1) SecondID should be entireley random and is 2) lowly expressed on both markers = negative

Bonus Exercise

Plot scatter plot of 2 HTOs across different cell lines e.g. K562 vs KG1 and colour by (single/doublet/negative status) and HTO_max ID

Compare number of RNA UMIs for singlets, doublets and negative cells

What is a suitable plot for such comparisons?

Answer:

Idents(hto12_object) <- 

Wuestion: what do you notice?

Answer:

Visualize HTO signals in a heatmap , lookup HTOHeatmap()

HTOHeatmap(hto12_object)

What do you notice? good confidence, almost entire heatmap coloured by the extreme ends of the colour spectrum

Generate a two dimensional tSNE or UMAP embedding for HTOs. Here we are grouping cells by singlets and doublets ONLY for simplicity.

Do we need to subset our object?

If so what are we subsetting out?


Idents(hto12_object) <- 
hto12_object.subset <- subset()

Run UMAP/TSNE

what assay are we running UMAP/tsne for ?

look up the arguments in RunUMAP() and/or RunTSNE() functions

check which arguments in RunUMAP/RunUMAP/RunTSNE can be used to change the name of the reduction from defauult name of pca/umap/tsne to custom name

before we Run UMAP, we need to scale and run PCA like we did in the normal single cell workflow

Answer:

# Calculate a tSNE & UMAP embedding of the HTO data
DefaultAssay(hto12_object.subset) <- "HTO"

hto12_object.subset <- RunUMAP()

check the Reductions in the object

Reductions()

Plot the UMAP/tsne for the HTO assay

• colour by if singlet/doublet

• colour by HTO final classification results (hash.ID)

what do you notice about the cluustering on tthe UMAP/tsne, does the number of clusters mean anything?

Answer:

what do you notice about the cloud of cells surrounding each cluster?

Answer:

Bonus exercises

You can also visualize the more detailed classification result by group.by HTO_maxID before plotting.

What happens if you group.by the UMAP/TSNE plot by HTO..maxID?

Answer:

Cluster and visualize cells using the usual scRNA-seq workflow, and examine for the potential presence of batch effects.

do we need to rerun FindVariableFeatures() and ScaleData() again?

Answer :

what other steps do we need run to get visualise our RNA data as UMAP/t-SNE coloured by doublets/singlets and celltypes?

Answer:

DefaultAssay(hto12_object.subset) <- "RNA"
# Run PCA on most variable features
hto12_object.subset <- 
hto12_object.subset <- 

hto12_object.subset <- RunPCA(hto12_object.subset)
hto12_object.subset <- RunUMAP(hto12_object.subset, dims = 1:8)

Plot RNA based UMAP

group.by hash.ID

create a new seurat object meta.data column called _cell_line , which removes "_A or _B or _C " in the hash.ID and replaces it with "", to create a new meta.data with only the cell-line info

#we create another metadata column based on the hash.id column, where we gsub the HTO tag info (-A,-B,-C) for each cell line to plot only the cell lien names to see if we have batch effect

hto12_object.subset$cell_line <- gsub(pattern = "[-ABC]")
DimPlot()

what does our RNA based clustering on the UMAP/T-SNE show?

Answer:

Bonus exercise (try in your own time)

create a second seurat object based , using the code above, and rerun the HTODemux() with a different value of positive quantile.

try to check if the classification changes massively if you adjusted the threshold for classification by playing around with the positive.quantile argument from the default.