Closed benjytan88 closed 6 years ago
We don't include a native function for this in Seurat, though it would be easy to do this on your own if you were interested. You could pull out your gene of interest (for example object@data["Elk1",]), and count the non-zero entries.
I see. Yes, I am doing that but to count them by clusters it's quite the hassle. That's fine then, I will stick to my current method. Thank you for your reply.
Hi, I've been trying to find a way to get the percentage (number) of cells expressing specific genes. Should I retrieve it from the features list? Is there a native function to get these numbers in Seurat v3.0? Thanks!
Hi, I've been trying to find a way to get the percentage (number) of cells expressing specific genes. Should I retrieve it from the features list? Is there a native function to get these numbers in Seurat v3.0? Thanks!
I was using their FindMarkers function with a loose threshold to calculate pct. The only issue is it may not give you all the genes you want
@brianpenghe @brunetc I wrote a small script based on functions in Seurat V3.0. It calculates percentage of total cells expressing a gene (raw counts > 0) by metadata groups. Let me know if you have suggestions. example:
PrctCellExpringGene(seurat_object ,genes =c("geneA","geneB"), group.by = "sample.ID") Markers Cell_proportion MS1.1 geneA 0.022727273 MS1.2 geneB 0.045337995 MS2.1 geneA 0.000000000 MS2.2 geneB 0.030033951 MS3.1 geneA 0.001821125 MS3.2 geneB 0.035410765
# updated 1/31/2020 to accommodate V3.1
# updated 2/4/2020 to output "NA" for genes not detected in certain subgroups
PrctCellExpringGene <- function(object, genes, group.by = "all"){
if(group.by == "all"){
prct = unlist(lapply(genes,calc_helper, object=object))
result = data.frame(Markers = genes, Cell_proportion = prct)
return(result)
}
else{
list = SplitObject(object, group.by)
factors = names(list)
results = lapply(list, PrctCellExpringGene, genes=genes)
for(i in 1:length(factors)){
results[[i]]$Feature = factors[i]
}
combined = do.call("rbind", results)
return(combined)
}
}
calc_helper <- function(object,genes){
counts = object[['RNA']]@counts
ncells = ncol(counts)
if(genes %in% row.names(counts)){
sum(counts[genes,]>0)/ncells
}else{return(NA)}
}
You can retrieve the number of cells expressing a gene my_gene
from a Seurat object my_object
in this way:
sum(GetAssayData(object = my_object, slot = "data")[my_gene,]>0)
The percentage of cells expressing that gene would be:
sum(GetAssayData(object = my_object, slot = "data")[my_gene,]>0)/nrow(my_object@meta.data)
The object my_object@meta.data
can be used to eventually extract cells (rownames) with a specific Identity or belonging to a specific cluster if you performed that analysis before, etc. After you have obtained a subset of cells of interest (my_cells
) you can obtain the same as above by passing them as colnames:
sum(GetAssayData(object = my_object, slot = "data")[my_gene,my_cells]>0)
sum(GetAssayData(object = my_object, slot = "data")[my_gene,my_cells]>0)/length(my_cells)
One way to get counts of cells with values > 0 for each cluster for a gene of interest is to use the data generated by calling VlnPlot.
For example:
library(dplyr)
p <- VlnPlot(object = my_object, features =c("my_gene"))
p$data %>% group_by(ident) %>% summarize(counts = sum(my_gene, na.rm = TRUE))
While not an elegant solution, does the trick for at least one gene of interest at time across clusters.
FWIW, the function posted by @Ryan-Zhu worked great for me to get total number of cells per "orig.ident" condition for my gene of interest.
Hi, I've been trying to find a way to get the percentage (number) of cells expressing specific genes. Should I retrieve it from the features list? Is there a native function to get these numbers in Seurat v3.0? Thanks!
I was using their FindMarkers function with a loose threshold to calculate pct. The only issue is it may not give you all the genes you want
@brianpenghe This is quite a clever trick. I am wondering why you think that it may not give all the genes. I tried it with entirely permissible thresholds: min.pct = 0, logfc.threshold = 0 and got data for 16k genes. However, total number of genes in the two groups which were compared is 20k. Note: The find marker task was performed on a sub cluster which may not have expression values for all the genes present in the original un-clustered data.
@Ryan-Zhu I like the function you wrote. It will be much better if we can group.by="seurat_clusters" Thanks! Yale
@Ryan-Zhu I like the function you wrote. It will be much better if we can group.by="seurat_clusters" Thanks! Yale
group.by
works with any columns that are present in your metadata. So if your YourObject$seurat_clusters
is not NULL you can definitely do group.by="seurat_clusters"
.
@Ryan-Zhu Thanks for your quick reply. When we run the function using "seurat_clusters" we receive the following error. Even if we run Idents(Object) <- Object$seurat_clusters
prior to running:
`Error in object[[group.by]] == x : comparison of these types is not implemented In addition: Warning message: In cells %||% colnames(x = object) :
Error in object[[group.by]] == x : comparison of these types is not implemented`
As verification we have "seurat_clusters" as a meta data column:
`> table(Object$seurat_clusters)
1 2 3 4 5 6 7 8 9 10 11 12 13 7928 7055 5367 5189 5007 4371 3921 1507 1322 709 672 571 511 `
Any suggestion?
Thanks so much!
Yale
@happar73 The error basically says object[[group.by]]
and x are not the same datatypes and are not comparable. Please try my updated version.
@Ryan-Zhu The updated one works. Thanks for your quick reply and thanks for your great help. I really appreciate it.
Many thanks, Yale
@Ryan-Zhu Thanks so much for this I work with Yale. This has saved us a great deal of work and it is a great function.
Is it possible to create a version that will report 2 meta data tables? For instance we have metadata column for "stim" (VEH, DRUG1, DRUG2, DRUG3, DRUG4). Does the function have the ability to report a table of cluster + stim?
@cpcook1 You can first split your object by "stim" and then run the function on the sub-objects individually.
stim.list <- SplitObject(YourObject, "stim")
results <- lapply(stim.list, PrctCellExpringGene, genes=genes, group.by="seurat_clusters")
@Ryan-Zhu Hi, How can I get the percent of cells expressing two genes (double positive cells for example KRT18+/APOE+)?
@brianpenghe @brunetc I wrote a small script based on functions in Seurat V3.0. It calculates percentage of total cells expressing a gene (raw counts > 0) by metadata groups. Let me know if you have suggestions. example:
PrctCellExpringGene(seurat_object ,genes =c("geneA","geneB"), group.by = "sample.ID") Markers Cell_proportion MS1.1 geneA 0.022727273 MS1.2 geneB 0.045337995 MS2.1 geneA 0.000000000 MS2.2 geneB 0.030033951 MS3.1 geneA 0.001821125 MS3.2 geneB 0.035410765
# updated 1/31/2020 to accommodate V3.1 # updated 2/4/2020 to output "NA" for genes not detected in certain subgroups PrctCellExpringGene <- function(object, genes, group.by = "all"){ if(group.by == "all"){ prct = unlist(lapply(genes,calc_helper, object=object)) result = data.frame(Markers = genes, Cell_proportion = prct) return(result) } else{ list = SplitObject(object, group.by) factors = names(list) results = lapply(list, PrctCellExpringGene, genes=genes) for(i in 1:length(factors)){ results[[i]]$Feature = factors[i] } combined = do.call("rbind", results) return(combined) } } calc_helper <- function(object,genes){ counts = object[['RNA']]@counts ncells = ncol(counts) if(genes %in% row.names(counts)){ sum(counts[genes,]>0)/ncells }else{return(NA)} }
Dear Ryan-Zhu thank you for sharing this, you should include this function in a paper so you will get cited. I am sure already many papers have exploited it ;).
All the best, Ale
@Ryan-Zhu God bless you, bro, what a great function! My question echos that of @Eli1360 how can we find the proportion of cells expressing two genes? Say INS+/PDX1+?
Edit: I thought id share with you all what I'm doing. Maybe it will help someone 😄 This is what I'm doing right now (disclaimer I do things the hard way because I suck at writing code, but it works so who cares):
# Problem: Find cells coexpressing a certain gene
# Solution:
# Here we assume the two genes are INS and PDX1, I want to know the percentage of Beta cells that are INS+/PDX1+
# Subset your cluster of interest for as an example I am subsetting a cluster called 'beta'
# The following creates a seurat object of only the cluster 'beta'
betacells <- subset(pancreas.integrated, idents = c("Beta"))
# Point your new cluster towards the object you will use to perform calculations.
# I like doing this because otherwise, you have to write lengths of redundant code
# Also I'm like really lazy with code
ThisWayIsTotallyMentalButItWorks <- betacells
GOI1 <- 'PDX1' #you will have to name your first gene here, im choosing PDX1 as an example
GOI2 <- 'INS' #you will have to name your second gene here, im choosing INS as an example
GOI1.cutoff <- 1 #Assumption: gene count cutoff is 1, assuming atleast 1 count is REAL
GOI2.cutoff <- 1 #Assumption: gene count cutoff is 1, assuming atleast 1 count is REAL
# Time to party
GOI1.cells <- length(which(FetchData(ThisWayIsTotallyMentalButItWorks, vars = GOI1) > GOI1.cutoff))
GOI2.cells <- length(which(FetchData(ThisWayIsTotallyMentalButItWorks, vars = GOI2) > GOI2.cutoff))
GOI1_GOI2.cells <- length(which(FetchData(ThisWayIsTotallyMentalButItWorks, vars = GOI2) > GOI2.cutoff & FetchData(ThisWayIsTotallyMentalButItWorks, vars = GOI1) > GOI1.cutoff))
all.cells.incluster <- table(ThisWayIsTotallyMentalButItWorks@active.ident)
GOI1.cells/all.cells.incluster*100 # Percentage of cells in Beta that express GOI1
GOI2.cells/all.cells.incluster*100 #Percentage of cells in Beta that express GOI2
GOI1_GOI2.cells/all.cells.incluster*100 #Percentage of cells in Beta that co-express GOI1 + GOI2
Second edit: pulling data from two metadata slots, simultaneously
# Problem: Find cells coexpressing a certain gene in males
# Solution:
# Here we assume the two genes are INS and PDX1, I want to know the percentage of Beta cells that are INS+/PDX1+
# Subset your cluster of interest for as an example I am subsetting a cluster called 'beta'
# The following creates a seurat object of only the cluster 'beta' pulling data from only 'male' donors
# Remember both of these metadata already exit in the seurat object, so be careful when creating your metadata substructure
betacells <- subset(pancreas.integrated, subset = (CellType == "Beta") & (sex == "male"))
# Point your new cluster towards the object you will use to perform calculations.
# I like doing this because otherwise, you have to write lengths of redundant code
# Also I'm like really lazy with code
ThisWayIsTotallyMentalButItWorks <- betacells
GOI1 <- 'PDX1' #you will have to name your first gene here, im choosing PDX1 as an example
GOI2 <- 'INS' #you will have to name your second gene here, im choosing INS as an example
GOI1.cutoff <- 1 #Assumption: gene count cutoff is 1, assuming atleast 1 count is REAL
GOI2.cutoff <- 1 #Assumption: gene count cutoff is 1, assuming atleast 1 count is REAL
# Time to party
GOI1.cells <- length(which(FetchData(ThisWayIsTotallyMentalButItWorks, vars = GOI1) > GOI1.cutoff))
GOI2.cells <- length(which(FetchData(ThisWayIsTotallyMentalButItWorks, vars = GOI2) > GOI2.cutoff))
GOI1_GOI2.cells <- length(which(FetchData(ThisWayIsTotallyMentalButItWorks, vars = GOI2) > GOI2.cutoff & FetchData(ThisWayIsTotallyMentalButItWorks, vars = GOI1) > GOI1.cutoff))
all.cells.incluster <- table(ThisWayIsTotallyMentalButItWorks@active.ident)
GOI1.cells/all.cells.incluster*100 # Percentage of cells in Beta that express GOI1
GOI2.cells/all.cells.incluster*100 #Percentage of cells in Beta that express GOI2
GOI1_GOI2.cells/all.cells.incluster*100 #Percentage of cells in Beta that co-express GOI1 + GOI2
Enjoy!
🐲
The code below also does the trick:
apply(object@assays$RNA@counts > 0, MARGIN = 1, FUN = 'sum') / ncol(object)
One way to get counts of cells with values > 0 for each cluster for a gene of interest is to use the data generated by calling VlnPlot.
For example:
library(dplyr) p <- VlnPlot(object = my_object, features =c("my_gene")) p$data %>% group_by(ident) %>% summarize(counts = sum(my_gene, na.rm = TRUE))
While not an elegant solution, does the trick for at least one gene of interest at time across clusters.
FWIW, the function posted by @Ryan-Zhu worked great for me to get total number of cells per "orig.ident" condition for my gene of interest.
Probably the easiest way that worked out for me in combination with "table" function
p <- VlnPlot(object = my_object, features =c("my_gene"), split.by="group")
table(p$data$my_gene>0.1,p$data$split) # 0.1 here is the threshold for expression
What about the following:
length(WhichCells(seurat.object, expression = PITX2 > 0))
@Ryan-Zhu God bless you, bro, what a great function! My question echos that of @Eli1360 how can we find the proportion of cells expressing two genes? Say INS+/PDX1+?
Edit: I thought id share with you all what I'm doing. Maybe it will help someone 😄 This is what I'm doing right now (disclaimer I do things the hard way because I suck at writing code, but it works so who cares):
# Problem: Find cells coexpressing a certain gene # Solution: # Here we assume the two genes are INS and PDX1, I want to know the percentage of Beta cells that are INS+/PDX1+ # Subset your cluster of interest for as an example I am subsetting a cluster called 'beta' # The following creates a seurat object of only the cluster 'beta' betacells <- subset(pancreas.integrated, idents = c("Beta")) # Point your new cluster towards the object you will use to perform calculations. # I like doing this because otherwise, you have to write lengths of redundant code # Also I'm like really lazy with code ThisWayIsTotallyMentalButItWorks <- betacells GOI1 <- 'PDX1' #you will have to name your first gene here, im choosing PDX1 as an example GOI2 <- 'INS' #you will have to name your second gene here, im choosing INS as an example GOI1.cutoff <- 1 #Assumption: gene count cutoff is 1, assuming atleast 1 count is REAL GOI2.cutoff <- 1 #Assumption: gene count cutoff is 1, assuming atleast 1 count is REAL # Time to party GOI1.cells <- length(which(FetchData(ThisWayIsTotallyMentalButItWorks, vars = GOI1) > GOI1.cutoff)) GOI2.cells <- length(which(FetchData(ThisWayIsTotallyMentalButItWorks, vars = GOI2) > GOI2.cutoff)) GOI1_GOI2.cells <- length(which(FetchData(ThisWayIsTotallyMentalButItWorks, vars = GOI2) > GOI2.cutoff & FetchData(ThisWayIsTotallyMentalButItWorks, vars = GOI1) > GOI1.cutoff)) all.cells.incluster <- table(ThisWayIsTotallyMentalButItWorks@active.ident) GOI1.cells/all.cells.incluster*100 # Percentage of cells in Beta that express GOI1 GOI2.cells/all.cells.incluster*100 #Percentage of cells in Beta that express GOI2 GOI1_GOI2.cells/all.cells.incluster*100 #Percentage of cells in Beta that co-express GOI1 + GOI2
Second edit: pulling data from two metadata slots, simultaneously
# Problem: Find cells coexpressing a certain gene in males # Solution: # Here we assume the two genes are INS and PDX1, I want to know the percentage of Beta cells that are INS+/PDX1+ # Subset your cluster of interest for as an example I am subsetting a cluster called 'beta' # The following creates a seurat object of only the cluster 'beta' pulling data from only 'male' donors # Remember both of these metadata already exit in the seurat object, so be careful when creating your metadata substructure betacells <- subset(pancreas.integrated, subset = (CellType == "Beta") & (sex == "male")) # Point your new cluster towards the object you will use to perform calculations. # I like doing this because otherwise, you have to write lengths of redundant code # Also I'm like really lazy with code ThisWayIsTotallyMentalButItWorks <- betacells GOI1 <- 'PDX1' #you will have to name your first gene here, im choosing PDX1 as an example GOI2 <- 'INS' #you will have to name your second gene here, im choosing INS as an example GOI1.cutoff <- 1 #Assumption: gene count cutoff is 1, assuming atleast 1 count is REAL GOI2.cutoff <- 1 #Assumption: gene count cutoff is 1, assuming atleast 1 count is REAL # Time to party GOI1.cells <- length(which(FetchData(ThisWayIsTotallyMentalButItWorks, vars = GOI1) > GOI1.cutoff)) GOI2.cells <- length(which(FetchData(ThisWayIsTotallyMentalButItWorks, vars = GOI2) > GOI2.cutoff)) GOI1_GOI2.cells <- length(which(FetchData(ThisWayIsTotallyMentalButItWorks, vars = GOI2) > GOI2.cutoff & FetchData(ThisWayIsTotallyMentalButItWorks, vars = GOI1) > GOI1.cutoff)) all.cells.incluster <- table(ThisWayIsTotallyMentalButItWorks@active.ident) GOI1.cells/all.cells.incluster*100 # Percentage of cells in Beta that express GOI1 GOI2.cells/all.cells.incluster*100 #Percentage of cells in Beta that express GOI2 GOI1_GOI2.cells/all.cells.incluster*100 #Percentage of cells in Beta that co-express GOI1 + GOI2
Enjoy!
🐲 Thanks a lot for sharing the code. I have a question after using this code to calculate the percentage of cells co-expressing two genes, I got a percentage > 100% for some of my values. Does someone know what may be the reason?
Thanks!
It's a bit of bringing something back from the dead, but I wanted to share some expansion I did on @Ryan-Zhu 's super useful code. I took what he wrote, added the ability to check other assays, data slots (datatype), set a threshold, and return a prettier table when checking multiple genes. Likely it is a hacky way to do it, but it gets the job done. Feel free to suggest improvements! I still end up using @Dragonmasterx87 's method for co-expression.
PrctCellExpringGene <- function(object, genes, group.by = "all", assay = "RNA", datatype = "counts", threshold = 0){
if(group.by == "all"){
prct = unlist(lapply(genes,calc_helper, object=object, assay = assay, datatype=datatype, threshold=threshold))
result = data.frame(Markers = genes, Cell_proportion = prct)
return(result)
}
else{
list = SplitObject(object, group.by)
factors = names(list)
results = lapply(list, PrctCellExpringGene, genes=genes, assay = assay, datatype=datatype, threshold=threshold)
results %>% reduce(full_join, by="Markers") %>% select(any_of("Markers")) -> genelist
results %>% reduce(full_join, by="Markers") %>% select(!any_of("Markers")) %>% "colnames<-"(names(results)) -> percentages
combined <- cbind(genelist,percentages)
return(combined)
}
}
calc_helper <- function(object,genes,assay,datatype,threshold){
counts = slot(object[[assay]],datatype)
ncells = ncol(counts)
if(genes %in% row.names(counts)){
round((sum(counts[genes,]>threshold)/ncells)*100,1)
}else{return(NA)}
}
So this
PrctCellExpringGene(data, genes = c("TOP2A","MKI67","B2M","GAPDH"),
group.by = "SCT_snn_res.0.1", assay = "RNA", datatype = "counts", threshold = 0)
Will give
Enjoy! Maybe the Seurat folks will drag in our script(s) some day!
@Ryan-Zhu Hi, How can I get the percent of cells expressing two genes (double positive cells for example KRT18+/APOE+)?
did you ever received a reply on that? I'm interested :)
I actually cooked up some code for that at some point! Warning: I wrote it for Seurat-v4, I don't know how nicely it will play with v5.
PrctCellCoExprGene <- function(object, gene1, gene2, group.by = "all", assay = "RNA", datatype = "counts", threshold = 0){
if(group.by == "all"){
prct = unlist(coex_helper(object=object,gene1=gene1,gene2=gene2,assay=assay,datatype=datatype,threshold=threshold))
result = data.frame(Status = c(paste0(gene1,"-only.pos"),paste0(gene2,"-only.pos"),"Double.pos","Double.neg"), Cell_proportion = prct)
return(result)
}
else{
list = SplitObject(object, group.by)
factors = names(list)
results = lapply(list, PrctCellCoExprGene, gene1=gene1, gene2=gene2, assay=assay, datatype=datatype, threshold=threshold)
results %>% reduce(full_join, by="Status") %>% select(any_of("Status")) -> statuslist
results %>% reduce(full_join, by="Status") %>% select(!any_of("Status")) %>% "colnames<-"(names(results)) -> percentages
combined <- cbind(statuslist,percentages)
return(combined)
}
}
coex_helper <- function(object,gene1,gene2,assay,datatype,threshold){
counts = slot(object[[assay]],datatype)
ncells = ncol(counts)
if(gene1 %in% row.names(counts)){
if(gene2 %in% row.names(counts)){
gene1.pos <- round((sum(counts[gene1,] > threshold & counts[gene2,] <= threshold)/ncells)*100,1)
gene2.pos <- round((sum(counts[gene2,] > threshold & counts[gene1,] <= threshold)/ncells)*100,1)
double.pos <- round((sum(counts[gene1,] > threshold & counts[gene2,] > threshold)/ncells)*100,1)
double.neg <- round((sum(counts[gene1,] <= threshold & counts[gene2,] <= threshold)/ncells)*100,1)
c(gene1.pos,gene2.pos,double.pos,double.neg)
}else{return(NA)}
}else{return(NA)}
}
So usage would be:
PrctCellCoExprGene(object, gene1, gene2, group.by = "all", assay = "RNA", datatype = "counts", threshold = 0)
I hope it helps you!
@bluedominion , sorry for the late reply. In the meantime, I tried something else, but I will definitely keep your code for the next time someone asks me that!!! that is much more straightforward to what I did. Thanks!
Hi Seurat team,
I was wondering if you could show me how can I calculate the number of cells expressing the given genes.
For example, I have the violin plot for 3 different genes below. I wanted to add the information of the number of cells expressing those genes for each cluster.
https://imgur.com/a/WzMjq
How can I do that? Are there any function in seurat which can do this?
Thank you very much.