SCTransform() command replaces NormalizeData(), ScaleData and FindVariableFeatures() run for the RNA assay in day 1 Seurat
Should we remove any confounding variables like we did for the RNA assay for Day 1?
Do we want to use the same number of variable featuresn(n=1000) or more than what we used for NormalizeData() function.
if percent_mt hadn't been calculated in previous qc yet: seurat_after_qc <- PercentageFeatureSet(seurat_after_qc,pattern = "^MT-", col.name = "percent.mt")
seurat_after_qc <- SCTransform(seurat_after_qc,
vars.to.regress = "percent_mt", #don't need to regress out nCount_RNA here as the model by principle accounts for sequencing depth
variable.features.n = 2000) #default is 3000
Where is the new normalisation stored?
Answer: stored in new assay called 'SCT'
ScTransform may discard genes - so cannot just replace 'data' slot, as it is a dataframe with different dimensions
Explore the seurat_after_qc objects metadata and assays
glimpse(seurat_after_qc[[]])
Is there a change?
Answer: less genes
Are there new columns in the metadata?
Answer: nCount_SCT, nFeature_SCT
Exercise
Visulaisation
The library size and number of features detected per cell is already present in the Seurat object.
When you run SCTransform you get two new variables for library size and features for SCT normalisation
Use the function VlnPlot() to compare
(i) RNA assay vs SCT assay library size in a single plot
(ii) features detected in RNA vs SCT a single plot
Check to see how you can have fixed y axes (ylims) in the VlnPlot() function
Check to see what reductions are now present in the object > [1] "pca" "umap" "sct.pca"
Reductions(seurat_after_qc)
First, visualise the amount of variance explained the top principal components for SCTransformed data (number of your choice).
How many principal components would you use for downstream analyses?
Compute the graph of nearest neighbours using the function FindNeighbors().
Which principal components are used by default?
Instead, specify the number of components that you have chosen.
Have you chosen the right reduction and assay?
seurat_after_qc <- FindNeighbors(seurat_after_qc,
reduction = "sct.pca",
dims = 1:20,
assay = "SCT") #don't need to specify names if using different assays, as will append prefix automatically
Graphs(seurat_after_qc)
Finally, compute cluster labels.
What is the default setting for the resolution argument?
Instead, set it to 0.5.
Do you expect more or fewer clusters following that change?
What other parameters would you also try to experiment with?
should we specify the graph.name?
seurat_after_qc <- FindClusters(seurat_after_qc,
resolution = 0.5,
graph.name = "SCT_snn") #snn (shared nearest neighbour) more robust than nn as bidirectional
Check cluster assignment between SCT and RNA workflow
If you use the same resolution = o.5 and dims as RNA workflow do you get the same number of cluster or more?
Are cells in the same cluster across both RNA and SCT
Visualise the SCT cluster labels on the SCT transformed UMAP scatter plot and the RNA cluster labels on the RNA UMAP
How would you describe the agreement between the UMAP layout and clustering for SCT vs RNA Assay results?
Bonus exercise to try in your own time:: Pathway analysis on Cluster markers for all clusters
Choose either RNA workflow based generated or SCT generated based Seurat marker results
we will be using Gprofiler gost() function for multiple gene lists at the same time
First we need to filter Seurat top significant (p_adj < 0.05) upregulated genes with a LogFC threshold (decided by you) for each cluster, use dplyr:: group_by() and dplyr::filter to get gene list for each cluster, then only select cluster and gene
We then use split() on the filtered_df to divide gene markers into list of multiple character vectors containing genes split by cluster
can refer to the pathway analysis code from previous tutorial, but use human not mouse pathways
First generate the list of markers for each cluster
We then run pathway analysis using gost() with multi_query = TRUE
To generate all_genes_id , we use all genes present in either the RNA assay or SCT assay, as we have already filtered out lowly expressed genes not present in certain cells.
# Choose Default assay based on if running pathway analyses on RNA or SCT results
DefaultAssay(seurat_after_qc) <- ""
# create a vector of of all genes
all_genes_id <-
multi_gostquery_results_obj <- gost()
can you plot the results for different clusters together ?
gostplot()
Afternoon Session
Demultiplexing with hashtag oligos (HTOs)
Dataset : 12-HTO dataset from four human cell lines
Dataset represent Data represent single cells collected from four cell lines: HEK, K562, KG1 and THP1
Each cell line was further split into three samples (12 samples in total)
Each sample was labeled with a hashing antibody mixture (CD29 and CD45), pooled, and run on a single lane of 10X.
Based on this design, we should be able to detect doublets both across and within cell types
Load in the UMI matrix for the RNA data and check the dimensions
Now we only want to subset to those cell barcodes or cells (actually called as cells by cellRanger or EmptyDrops on the gene expression data) which are detected by both RNA and HTO matrices
Check the class of the joint.bcs object and how many cell barcodes do we have in common
Subset the RNA matrix to only the joint.bcs cell barcodes and check the dimensions
hto12.umis.common <- hto12.umis[,joint.bcs]
head(hto12.umis.common) #25088 rather than 30000 cells now
Create a Seurat object with the RNA (UMI count matrix) data using only the joint bcs
Name the object hto12_object
Include features detected in at least 3 cells, and cells where at least 200 features detected
Normalise with log normalsiation ,find variable genes and Scale RNA data
First subset the HTO matrix to those cell barcodes which are now in the hto12_object Seurat object and make sure that the features only consist of the HTO tags
Is our subsetted hto12.htos.common in the right format? if not what do we do to get it in the right format before adding it as another assay?
Answer :
hto12.htos.common <- hto12.htos[colnames(hto12_object),#don't use joint.bcs as we've done some filtering when we created the seurat object
1:12] %>% #remove additional metadata
t() #transpose to set HTOs as features (stored in rows)
glimpse(hto12.htos.common)
class(hto12.htos.common)
Now use CreateAssayObject() to add the subsetted HTO matrix to the already created hto12_object seurat object as a new assay called HTO
Do we want to do any further filtering on the HTO object?
Answer :
Normalise the HTO data , here we will use the CLR transformation with margin =1 (Default setting)
CLR: Applies a centered log ratio transformation
This is required because the HTO data is bimodal, i.e. the tag is either present or absent on the cell. This is quite different to the RNA counts.
# check the Default Assay
DefaultAssay(hto12_object) <- "HTO"
hto12_object <- NormalizeData(hto12_object, assay = "HTO",
normalization.method = "CLR" , #HTO data is bimodal (each cell should be labelled with either the presence or absence of a label i.e. not normally distributed)
margin=1) #margin = 1 ~ by row (margin = 2 ~ by column)
Demultiplex cells based on HTO enrichment
Here we use Seurat Function HTODemux() to assign single cells to their original samples
hto12_object <- HTODemux(hto12_object,
assay = "HTO",
positive.quantile = 0.99) #p value cut-off for results assigning
Checkout the metadata column of the hto12_object, try to read the HTODemux() results output summary in the Value section to understand the results
head(hto12_object[[]]) #HTO_margin is a score representing the difference between HTO_maxID and HTO_secondID -> used to define doublets vs. singlets
Visualise the Demultiplexing results
We can visualise how many cells are classified as singlets, doublets and negative/ambiguous cells
Check the meta.data, which column do we want for this information?
table(hto12_object[[]]$hash.ID)
Visualize enrichment for selected HTOs with ridge plots
plot the max HTO signal for one of the HTO of each of the 4 cell lines (HEK, K562, KG1 and THP1) features with ridge plots using the RidgePlot() function
plot Max HTO signal
# Change the identities of the seurat object to the relevant metadata column
Idents(hto12_object) <- "hash.ID"
RidgePlot(hto12_object,
assay = "HTO",
features = c("KG1-A","HEK-A","THP1-A","K562-A")) #chooseing four out of 12 samples
Visualize pairs of HTO signals to confirm mutual exclusivity in singlets between the same celline
a) plot scatter plot of 2 HTOs within the same cell line e.g. HEK, colour by (single/doublet/negative status)
b) plot scatter plot of 2 HTOs within the same cell line e.g. HEK, colour by HTO_maxID
c) plot scatter plot of 2 HTOs within the same cell line e.g. HEK, colour by HTO_secondID
look up the arguments in RunUMAP() and/or RunTSNE() functions
check which arguments in RunUMAP/RunUMAP/RunTSNE can be used to change the name of the reduction from defauult name of pca/umap/tsne to custom name
before we Run UMAP, we need to scale and run PCA like we did in the normal single cell workflow
Answer:
# Calculate a tSNE & UMAP embedding of the HTO data
DefaultAssay(hto12_object.subset) <- "HTO"
hto12_object.subset <- RunUMAP()
check the Reductions in the object
Reductions()
Plot the UMAP/tsne for the HTO assay
which reduction shall we plot?
• colour by if singlet/doublet
• colour by HTO final classification results (hash.ID)
check the arguments on how label the clusters by the cluster identity
can change the label size?
what do you notice about the cluustering on tthe UMAP/tsne, does the number of clusters mean anything?
Answer:
what do you notice about the cloud of cells surrounding each cluster?
Answer:
Bonus exercises
You can also visualize the more detailed classification result by group.by HTO_maxID before plotting.
What happens if you group.by the UMAP/TSNE plot by HTO..maxID?
Answer:
Cluster and visualize cells using the usual scRNA-seq workflow, and examine for the potential presence of batch effects.
do we need to rerun FindVariableFeatures() and ScaleData() again?
Answer :
what other steps do we need run to get visualise our RNA data as UMAP/t-SNE coloured by doublets/singlets and celltypes?
Answer:
DefaultAssay(hto12_object.subset) <- "RNA"
# Run PCA on most variable features
hto12_object.subset <-
hto12_object.subset <-
hto12_object.subset <- RunPCA(hto12_object.subset)
hto12_object.subset <- RunUMAP(hto12_object.subset, dims = 1:8)
Plot RNA based UMAP
group.by hash.ID
create a new seurat object meta.data column called _cell_line , which removes "_A or _B or _C " in the hash.ID and replaces it with "", to create a new meta.data with only the cell-line info
#we create another metadata column based on the hash.id column, where we gsub the HTO tag info (-A,-B,-C) for each cell line to plot only the cell lien names to see if we have batch effect
hto12_object.subset$cell_line <- gsub(pattern = "[-ABC]")
DimPlot()
what does our RNA based clustering on the UMAP/T-SNE show?
Answer:
Bonus exercise (try in your own time)
create a second seurat object based , using the code above, and rerun the HTODemux() with a different value of positive quantile.
try to check if the classification changes massively if you adjusted the threshold for classification by playing around with the positive.quantile argument from the default.
title: "Example code for single-cell analysis with Seurat, day 2" author: "Devika Agarwal, updated by Carla Cohen" date: "25/10/2022" output: html_document
Exercise
Read in the RDS object we created and save from Seurat day 1
readRDS()
function to read in previously saved objectApply SCTransfrom normalisation
Use
SCTransform()
functionSCTransform vignette: https://satijalab.org/seurat/articles/sctransform_vignette.html
SCTransform()
command replacesNormalizeData()
,ScaleData
andFindVariableFeatures()
run for the RNA assay in day 1 SeuratShould we remove any confounding variables like we did for the RNA assay for Day 1?
Do we want to use the same number of variable featuresn(n=1000) or more than what we used for
NormalizeData()
function.if percent_mt hadn't been calculated in previous qc yet: seurat_after_qc <- PercentageFeatureSet(seurat_after_qc,pattern = "^MT-", col.name = "percent.mt")
Where is the new normalisation stored? Answer: stored in new assay called 'SCT' ScTransform may discard genes - so cannot just replace 'data' slot, as it is a dataframe with different dimensions
Explore the
seurat_after_qc
objects metadata and assaysIs there a change? Answer: less genes
Are there new columns in the metadata? Answer: nCount_SCT, nFeature_SCT
Exercise
Visulaisation
The library size and number of features detected per cell is already present in the Seurat object.
When you run
SCTransform
you get two new variables for library size and features for SCT normalisationUse the function
VlnPlot()
to compare (i) RNA assay vs SCT assay library size in a single plot (ii) features detected in RNA vs SCT a single plotCheck to see how you can have fixed y axes (ylims) in the
VlnPlot()
functionBonus-
Let's choose LYZ like day 1
Use the function
VariableFeatures
to pull out the 1:10 the variable genes after SCT and compare to 1:10 from the RNA assayDo we need to change any arguments to get the variables genes specific to the SCT or RNA assay
How do the two gene lists compare?
Exercise
Dimensionality reduction on SCT transformed data
Run a principal component analysis and UMAP on the Seurat object.
Check the Default assay
Do we want to change the
reduction.name
argument so that we can still keep the RNA assay based PCA results?Check to see what reductions are now present in the object > [1] "pca" "umap" "sct.pca"
First, visualise the amount of variance explained the top principal components for SCTransformed data (number of your choice). How many principal components would you use for downstream analyses?
Do we need to specify the reduction?
How can we change the reduction name from default "umap" to "sct.umap"
How can we specify that we want to use PCA run on the SCT Assay (sct.pca) in the previous step?
Use DimPlot() to plot the umap. What happens if you try to specify different reductions with UMAPPlot?
Compare RNA based umap with sct.umap
Exercise
Clustering on SCTransformed data
FindNeighbors()
. Which principal components are used by default? Instead, specify the number of components that you have chosen. Have you chosen the rightreduction
andassay
?resolution
argument? Instead, set it to0.5
. Do you expect more or fewer clusters following that change? What other parameters would you also try to experiment with?graph.name
?Check cluster assignment between SCT and RNA workflow
If you use the same resolution = o.5 and dims as RNA workflow do you get the same number of cluster or more?
Are cells in the same cluster across both RNA and SCT
Plot some known cell-type markers for PBMC datasets, does the SCT better separate the celltypes?
CD14+ Monocyte : LYZ, CD14 CD16 Monocytes : FCGR3A, MS4A7 CD4 T : CD4, IL76 CD8 T : CD8A, CD3D NK : GNLY, GZMB,NKG7 B Cell : MS4A1 , CD79A DC : CST3, FCER1A Platelets : PPBP
Calculate the markers for these clusters from either the RNA or SCT assay
Bonus exercise to try in your own time:: Pathway analysis on Cluster markers for all clusters
Choose either RNA workflow based generated or SCT generated based Seurat marker results
we will be using Gprofiler
gost()
function for multiple gene lists at the same timeFirst we need to filter Seurat top significant (p_adj < 0.05) upregulated genes with a LogFC threshold (decided by you) for each cluster, use
dplyr:: group_by()
anddplyr::filter
to get gene list for each cluster, then only select cluster and geneWe then use
split()
on the filtered_df to divide gene markers into list of multiple character vectors containing genes split by clustercan refer to the pathway analysis code from previous tutorial, but use human not mouse pathways
First generate the list of markers for each cluster
We then run pathway analysis using
gost()
with multi_query = TRUEcan you plot the results for different clusters together ?
Afternoon Session
Demultiplexing with hashtag oligos (HTOs)
Dataset : 12-HTO dataset from four human cell lines
Load in the UMI matrix for the RNA data and check the dimensions
What do rows and columns represent? Answer: rows represent genes, columns represent barcodes
Load in the HTO matrix and check the dimensions
This is really messy - lots of barcodes have shared HTO identities, which is why an algorithm is required to deconvolute this
Now we only want to subset to those cell barcodes or cells (actually called as cells by cellRanger or EmptyDrops on the gene expression data) which are detected by both RNA and HTO matrices
Subset the RNA matrix to only the
joint.bcs
cell barcodes and check the dimensionsCreate a Seurat object with the RNA (UMI count matrix) data using only the joint bcs
Name the object
hto12_object
Include features detected in at least 3 cells, and cells where at least 200 features detected Normalise with log normalsiation ,find variable genes and Scale RNA dataAdd HTO data as another assay to
hto12_object
hto12_object
Seurat object and make sure that the features only consist of the HTO tagshto12.htos.common
in the right format? if not what do we do to get it in the right format before adding it as another assay? Answer :Now use
CreateAssayObject()
to add the subsetted HTO matrix to the already createdhto12_object
seurat object as a new assay calledHTO
Normalise the HTO data , here we will use the CLR transformation with margin =1 (Default setting) CLR: Applies a centered log ratio transformation This is required because the HTO data is bimodal, i.e. the tag is either present or absent on the cell. This is quite different to the RNA counts.
Demultiplex cells based on HTO enrichment
Here we use Seurat Function
HTODemux()
to assign single cells to their original samplesCheckout the metadata column of the
hto12_object
, try to read theHTODemux()
results output summary in theValue
section to understand the resultsVisualise the Demultiplexing results
We can visualise how many cells are classified as singlets, doublets and negative/ambiguous cells
Check the meta.data, which column do we want for this information?
Visualize enrichment for selected HTOs with ridge plots
plot the max HTO signal for one of the HTO of each of the 4 cell lines (HEK, K562, KG1 and THP1) features with ridge plots using the
RidgePlot()
functionplot Max HTO signal
Visualize pairs of HTO signals to confirm mutual exclusivity in singlets between the same celline
a) plot scatter plot of 2 HTOs within the same cell line e.g. HEK, colour by (single/doublet/negative status)
b) plot scatter plot of 2 HTOs within the same cell line e.g. HEK, colour by HTO_maxID
c) plot scatter plot of 2 HTOs within the same cell line e.g. HEK, colour by HTO_secondID
use the function
FeatureScatter()
what do you notice ?
1) SecondID should be entireley random and is 2) lowly expressed on both markers = negative
Bonus Exercise
Plot scatter plot of 2 HTOs across different cell lines e.g. K562 vs KG1 and colour by (single/doublet/negative status) and HTO_max ID
Compare number of RNA UMIs for singlets, doublets and negative cells
What is a suitable plot for such comparisons?
Answer:
Wuestion: what do you notice?
Answer:
Visualize HTO signals in a heatmap , lookup
HTOHeatmap()
What do you notice? good confidence, almost entire heatmap coloured by the extreme ends of the colour spectrum
Generate a two dimensional tSNE or UMAP embedding for HTOs. Here we are grouping cells by singlets and doublets ONLY for simplicity.
Do we need to subset our object?
If so what are we subsetting out?
Run UMAP/TSNE
what assay are we running UMAP/tsne for ?
look up the arguments in
RunUMAP()
and/orRunTSNE()
functionscheck which arguments in RunUMAP/RunUMAP/RunTSNE can be used to change the name of the reduction from defauult name of pca/umap/tsne to custom name
before we Run UMAP, we need to scale and run PCA like we did in the normal single cell workflow
Answer:
check the Reductions in the object
Plot the UMAP/tsne for the HTO assay
• colour by if singlet/doublet
• colour by HTO final classification results (hash.ID)
check the arguments on how label the clusters by the cluster identity
can change the label size?
what do you notice about the cluustering on tthe UMAP/tsne, does the number of clusters mean anything?
Answer:
what do you notice about the cloud of cells surrounding each cluster?
Answer:
Bonus exercises
You can also visualize the more detailed classification result by group.by HTO_maxID before plotting.
What happens if you group.by the UMAP/TSNE plot by HTO..maxID?
Answer:
Cluster and visualize cells using the usual scRNA-seq workflow, and examine for the potential presence of batch effects.
do we need to rerun
FindVariableFeatures()
andScaleData()
again?Answer :
what other steps do we need run to get visualise our RNA data as UMAP/t-SNE coloured by doublets/singlets and celltypes?
Answer:
Plot RNA based UMAP
group.by hash.ID
create a new seurat object meta.data column called _cell_line , which removes "_A or _B or _C " in the hash.ID and replaces it with "", to create a new meta.data with only the cell-line info
what does our RNA based clustering on the UMAP/T-SNE show?
Answer:
Bonus exercise (try in your own time)
create a second seurat object based , using the code above, and rerun the
HTODemux()
with a different value of positive quantile.try to check if the classification changes massively if you adjusted the threshold for classification by playing around with the
positive.quantile
argument from the default.