satijalab / seurat

R toolkit for single cell genomics
http://www.satijalab.org/seurat
Other
2.29k stars 914 forks source link

Count number of cells expressing a gene #371

Closed benjytan88 closed 6 years ago

benjytan88 commented 6 years ago

Hi Seurat team,

I was wondering if you could show me how can I calculate the number of cells expressing the given genes.

For example, I have the violin plot for 3 different genes below. I wanted to add the information of the number of cells expressing those genes for each cluster.

https://imgur.com/a/WzMjq

How can I do that? Are there any function in seurat which can do this?

Thank you very much.

satijalab commented 6 years ago

We don't include a native function for this in Seurat, though it would be easy to do this on your own if you were interested. You could pull out your gene of interest (for example object@data["Elk1",]), and count the non-zero entries.

benjytan88 commented 6 years ago

I see. Yes, I am doing that but to count them by clusters it's quite the hassle. That's fine then, I will stick to my current method. Thank you for your reply.

brunetc commented 5 years ago

Hi, I've been trying to find a way to get the percentage (number) of cells expressing specific genes. Should I retrieve it from the features list? Is there a native function to get these numbers in Seurat v3.0? Thanks!

brianpenghe commented 5 years ago

Hi, I've been trying to find a way to get the percentage (number) of cells expressing specific genes. Should I retrieve it from the features list? Is there a native function to get these numbers in Seurat v3.0? Thanks!

I was using their FindMarkers function with a loose threshold to calculate pct. The only issue is it may not give you all the genes you want

Ryan-Zhu commented 5 years ago

@brianpenghe @brunetc I wrote a small script based on functions in Seurat V3.0. It calculates percentage of total cells expressing a gene (raw counts > 0) by metadata groups. Let me know if you have suggestions. example:

PrctCellExpringGene(seurat_object ,genes =c("geneA","geneB"), group.by = "sample.ID") Markers Cell_proportion MS1.1 geneA 0.022727273 MS1.2 geneB 0.045337995 MS2.1 geneA 0.000000000 MS2.2 geneB 0.030033951 MS3.1 geneA 0.001821125 MS3.2 geneB 0.035410765

# updated 1/31/2020 to accommodate V3.1
# updated 2/4/2020 to output "NA" for genes not detected in certain subgroups

PrctCellExpringGene <- function(object, genes, group.by = "all"){
    if(group.by == "all"){
        prct = unlist(lapply(genes,calc_helper, object=object))
        result = data.frame(Markers = genes, Cell_proportion = prct)
        return(result)
    }

    else{        
        list = SplitObject(object, group.by)
        factors = names(list)

        results = lapply(list, PrctCellExpringGene, genes=genes)
        for(i in 1:length(factors)){
        results[[i]]$Feature = factors[i]
        }
        combined = do.call("rbind", results)
        return(combined)
    }
}

calc_helper <- function(object,genes){
    counts = object[['RNA']]@counts
    ncells = ncol(counts)
    if(genes %in% row.names(counts)){
    sum(counts[genes,]>0)/ncells
    }else{return(NA)}
}
rcalandrelli commented 5 years ago

You can retrieve the number of cells expressing a gene my_gene from a Seurat object my_object in this way:

sum(GetAssayData(object = my_object, slot = "data")[my_gene,]>0)

The percentage of cells expressing that gene would be:

sum(GetAssayData(object = my_object, slot = "data")[my_gene,]>0)/nrow(my_object@meta.data)

The object my_object@meta.data can be used to eventually extract cells (rownames) with a specific Identity or belonging to a specific cluster if you performed that analysis before, etc. After you have obtained a subset of cells of interest (my_cells) you can obtain the same as above by passing them as colnames:

sum(GetAssayData(object = my_object, slot = "data")[my_gene,my_cells]>0)
sum(GetAssayData(object = my_object, slot = "data")[my_gene,my_cells]>0)/length(my_cells)
dkcoxie commented 5 years ago

One way to get counts of cells with values > 0 for each cluster for a gene of interest is to use the data generated by calling VlnPlot.

For example:

library(dplyr)
p <- VlnPlot(object = my_object, features =c("my_gene"))
p$data %>% group_by(ident) %>% summarize(counts = sum(my_gene, na.rm = TRUE))

While not an elegant solution, does the trick for at least one gene of interest at time across clusters.

FWIW, the function posted by @Ryan-Zhu worked great for me to get total number of cells per "orig.ident" condition for my gene of interest.

Tushar-87 commented 4 years ago

Hi, I've been trying to find a way to get the percentage (number) of cells expressing specific genes. Should I retrieve it from the features list? Is there a native function to get these numbers in Seurat v3.0? Thanks!

I was using their FindMarkers function with a loose threshold to calculate pct. The only issue is it may not give you all the genes you want

@brianpenghe This is quite a clever trick. I am wondering why you think that it may not give all the genes. I tried it with entirely permissible thresholds: min.pct = 0, logfc.threshold = 0 and got data for 16k genes. However, total number of genes in the two groups which were compared is 20k. Note: The find marker task was performed on a sub cluster which may not have expression values for all the genes present in the original un-clustered data.

Yale73 commented 4 years ago

@Ryan-Zhu I like the function you wrote. It will be much better if we can group.by="seurat_clusters" Thanks! Yale

Ryan-Zhu commented 4 years ago

@Ryan-Zhu I like the function you wrote. It will be much better if we can group.by="seurat_clusters" Thanks! Yale

group.by works with any columns that are present in your metadata. So if your YourObject$seurat_clusters is not NULL you can definitely do group.by="seurat_clusters".

Yale73 commented 4 years ago

@Ryan-Zhu Thanks for your quick reply. When we run the function using "seurat_clusters" we receive the following error. Even if we run Idents(Object) <- Object$seurat_clusters prior to running:

`Error in object[[group.by]] == x : comparison of these types is not implemented In addition: Warning message: In cells %||% colnames(x = object) :

Error in object[[group.by]] == x : comparison of these types is not implemented`

As verification we have "seurat_clusters" as a meta data column:

`> table(Object$seurat_clusters)

1 2 3 4 5 6 7 8 9 10 11 12 13 7928 7055 5367 5189 5007 4371 3921 1507 1322 709 672 571 511 `

Any suggestion?

Thanks so much!

Yale

Ryan-Zhu commented 4 years ago

@happar73 The error basically says object[[group.by]] and x are not the same datatypes and are not comparable. Please try my updated version.

Yale73 commented 4 years ago

@Ryan-Zhu The updated one works. Thanks for your quick reply and thanks for your great help. I really appreciate it.

Many thanks, Yale

cpcook1 commented 4 years ago

@Ryan-Zhu Thanks so much for this I work with Yale. This has saved us a great deal of work and it is a great function.

Is it possible to create a version that will report 2 meta data tables? For instance we have metadata column for "stim" (VEH, DRUG1, DRUG2, DRUG3, DRUG4). Does the function have the ability to report a table of cluster + stim?

Ryan-Zhu commented 4 years ago

@cpcook1 You can first split your object by "stim" and then run the function on the sub-objects individually.

stim.list <- SplitObject(YourObject, "stim")
results <- lapply(stim.list, PrctCellExpringGene, genes=genes, group.by="seurat_clusters")
Eli1360 commented 4 years ago

@Ryan-Zhu Hi, How can I get the percent of cells expressing two genes (double positive cells for example KRT18+/APOE+)?

raveancic commented 3 years ago

@brianpenghe @brunetc I wrote a small script based on functions in Seurat V3.0. It calculates percentage of total cells expressing a gene (raw counts > 0) by metadata groups. Let me know if you have suggestions. example:

PrctCellExpringGene(seurat_object ,genes =c("geneA","geneB"), group.by = "sample.ID") Markers Cell_proportion MS1.1 geneA 0.022727273 MS1.2 geneB 0.045337995 MS2.1 geneA 0.000000000 MS2.2 geneB 0.030033951 MS3.1 geneA 0.001821125 MS3.2 geneB 0.035410765

# updated 1/31/2020 to accommodate V3.1
# updated 2/4/2020 to output "NA" for genes not detected in certain subgroups

PrctCellExpringGene <- function(object, genes, group.by = "all"){
    if(group.by == "all"){
        prct = unlist(lapply(genes,calc_helper, object=object))
        result = data.frame(Markers = genes, Cell_proportion = prct)
        return(result)
    }

    else{        
        list = SplitObject(object, group.by)
        factors = names(list)

        results = lapply(list, PrctCellExpringGene, genes=genes)
        for(i in 1:length(factors)){
        results[[i]]$Feature = factors[i]
        }
        combined = do.call("rbind", results)
        return(combined)
    }
}

calc_helper <- function(object,genes){
    counts = object[['RNA']]@counts
    ncells = ncol(counts)
    if(genes %in% row.names(counts)){
    sum(counts[genes,]>0)/ncells
    }else{return(NA)}
}

Dear Ryan-Zhu thank you for sharing this, you should include this function in a paper so you will get cited. I am sure already many papers have exploited it ;).

All the best, Ale

Dragonmasterx87 commented 3 years ago

@Ryan-Zhu God bless you, bro, what a great function! My question echos that of @Eli1360 how can we find the proportion of cells expressing two genes? Say INS+/PDX1+?

Edit: I thought id share with you all what I'm doing. Maybe it will help someone 😄 This is what I'm doing right now (disclaimer I do things the hard way because I suck at writing code, but it works so who cares):

# Problem: Find cells coexpressing a certain gene

# Solution:
# Here we assume the two genes are INS and PDX1, I want to know the percentage of Beta cells that are INS+/PDX1+
# Subset your cluster of interest for as an example I am subsetting a cluster called 'beta'
# The following creates a seurat object of only the cluster 'beta'
betacells <- subset(pancreas.integrated, idents = c("Beta"))

# Point your new cluster towards the object you will use to perform calculations.
# I like doing this because otherwise, you have to write lengths of redundant code
# Also I'm like really lazy with code
ThisWayIsTotallyMentalButItWorks <- betacells
GOI1 <- 'PDX1' #you will have to name your first gene here, im choosing PDX1 as an example
GOI2 <- 'INS' #you will have to name your second gene here, im choosing INS as an example
GOI1.cutoff <- 1 #Assumption: gene count cutoff is 1, assuming atleast 1 count is REAL
GOI2.cutoff <- 1 #Assumption: gene count cutoff is 1, assuming atleast 1 count is REAL

# Time to party
GOI1.cells <- length(which(FetchData(ThisWayIsTotallyMentalButItWorks, vars = GOI1) > GOI1.cutoff))
GOI2.cells <- length(which(FetchData(ThisWayIsTotallyMentalButItWorks, vars = GOI2) > GOI2.cutoff))
GOI1_GOI2.cells <- length(which(FetchData(ThisWayIsTotallyMentalButItWorks, vars = GOI2) > GOI2.cutoff & FetchData(ThisWayIsTotallyMentalButItWorks, vars = GOI1) > GOI1.cutoff))
all.cells.incluster <- table(ThisWayIsTotallyMentalButItWorks@active.ident)
GOI1.cells/all.cells.incluster*100 # Percentage of cells in Beta that express GOI1
GOI2.cells/all.cells.incluster*100 #Percentage of cells in Beta that express GOI2
GOI1_GOI2.cells/all.cells.incluster*100 #Percentage of cells in Beta that co-express GOI1 + GOI2

Second edit: pulling data from two metadata slots, simultaneously

# Problem: Find cells coexpressing a certain gene in males

# Solution:
# Here we assume the two genes are INS and PDX1, I want to know the percentage of Beta cells that are INS+/PDX1+
# Subset your cluster of interest for as an example I am subsetting a cluster called 'beta'
# The following creates a seurat object of only the cluster 'beta' pulling data from only 'male' donors
# Remember both of these metadata already exit in the seurat object, so be careful when creating your metadata substructure
betacells <- subset(pancreas.integrated, subset = (CellType == "Beta") & (sex == "male"))

# Point your new cluster towards the object you will use to perform calculations.
# I like doing this because otherwise, you have to write lengths of redundant code
# Also I'm like really lazy with code
ThisWayIsTotallyMentalButItWorks <- betacells
GOI1 <- 'PDX1' #you will have to name your first gene here, im choosing PDX1 as an example
GOI2 <- 'INS' #you will have to name your second gene here, im choosing INS as an example
GOI1.cutoff <- 1 #Assumption: gene count cutoff is 1, assuming atleast 1 count is REAL
GOI2.cutoff <- 1 #Assumption: gene count cutoff is 1, assuming atleast 1 count is REAL

# Time to party
GOI1.cells <- length(which(FetchData(ThisWayIsTotallyMentalButItWorks, vars = GOI1) > GOI1.cutoff))
GOI2.cells <- length(which(FetchData(ThisWayIsTotallyMentalButItWorks, vars = GOI2) > GOI2.cutoff))
GOI1_GOI2.cells <- length(which(FetchData(ThisWayIsTotallyMentalButItWorks, vars = GOI2) > GOI2.cutoff & FetchData(ThisWayIsTotallyMentalButItWorks, vars = GOI1) > GOI1.cutoff))
all.cells.incluster <- table(ThisWayIsTotallyMentalButItWorks@active.ident)
GOI1.cells/all.cells.incluster*100 # Percentage of cells in Beta that express GOI1
GOI2.cells/all.cells.incluster*100 #Percentage of cells in Beta that express GOI2
GOI1_GOI2.cells/all.cells.incluster*100 #Percentage of cells in Beta that co-express GOI1 + GOI2

Enjoy!

🐲

prullens commented 3 years ago

The code below also does the trick: apply(object@assays$RNA@counts > 0, MARGIN = 1, FUN = 'sum') / ncol(object)

RajneeshSrivastava commented 2 years ago

One way to get counts of cells with values > 0 for each cluster for a gene of interest is to use the data generated by calling VlnPlot.

For example:

library(dplyr)
p <- VlnPlot(object = my_object, features =c("my_gene"))
p$data %>% group_by(ident) %>% summarize(counts = sum(my_gene, na.rm = TRUE))

While not an elegant solution, does the trick for at least one gene of interest at time across clusters.

FWIW, the function posted by @Ryan-Zhu worked great for me to get total number of cells per "orig.ident" condition for my gene of interest.

Probably the easiest way that worked out for me in combination with "table" function

p <- VlnPlot(object = my_object, features =c("my_gene"), split.by="group")
table(p$data$my_gene>0.1,p$data$split)  # 0.1 here is the threshold for expression
joaquinhenriquez commented 2 years ago

What about the following:

length(WhichCells(seurat.object, expression = PITX2 > 0))

shaoxunwang88 commented 2 years ago

@Ryan-Zhu God bless you, bro, what a great function! My question echos that of @Eli1360 how can we find the proportion of cells expressing two genes? Say INS+/PDX1+?

Edit: I thought id share with you all what I'm doing. Maybe it will help someone 😄 This is what I'm doing right now (disclaimer I do things the hard way because I suck at writing code, but it works so who cares):

# Problem: Find cells coexpressing a certain gene

# Solution:
# Here we assume the two genes are INS and PDX1, I want to know the percentage of Beta cells that are INS+/PDX1+
# Subset your cluster of interest for as an example I am subsetting a cluster called 'beta'
# The following creates a seurat object of only the cluster 'beta'
betacells <- subset(pancreas.integrated, idents = c("Beta"))

# Point your new cluster towards the object you will use to perform calculations.
# I like doing this because otherwise, you have to write lengths of redundant code
# Also I'm like really lazy with code
ThisWayIsTotallyMentalButItWorks <- betacells
GOI1 <- 'PDX1' #you will have to name your first gene here, im choosing PDX1 as an example
GOI2 <- 'INS' #you will have to name your second gene here, im choosing INS as an example
GOI1.cutoff <- 1 #Assumption: gene count cutoff is 1, assuming atleast 1 count is REAL
GOI2.cutoff <- 1 #Assumption: gene count cutoff is 1, assuming atleast 1 count is REAL

# Time to party
GOI1.cells <- length(which(FetchData(ThisWayIsTotallyMentalButItWorks, vars = GOI1) > GOI1.cutoff))
GOI2.cells <- length(which(FetchData(ThisWayIsTotallyMentalButItWorks, vars = GOI2) > GOI2.cutoff))
GOI1_GOI2.cells <- length(which(FetchData(ThisWayIsTotallyMentalButItWorks, vars = GOI2) > GOI2.cutoff & FetchData(ThisWayIsTotallyMentalButItWorks, vars = GOI1) > GOI1.cutoff))
all.cells.incluster <- table(ThisWayIsTotallyMentalButItWorks@active.ident)
GOI1.cells/all.cells.incluster*100 # Percentage of cells in Beta that express GOI1
GOI2.cells/all.cells.incluster*100 #Percentage of cells in Beta that express GOI2
GOI1_GOI2.cells/all.cells.incluster*100 #Percentage of cells in Beta that co-express GOI1 + GOI2

Second edit: pulling data from two metadata slots, simultaneously

# Problem: Find cells coexpressing a certain gene in males

# Solution:
# Here we assume the two genes are INS and PDX1, I want to know the percentage of Beta cells that are INS+/PDX1+
# Subset your cluster of interest for as an example I am subsetting a cluster called 'beta'
# The following creates a seurat object of only the cluster 'beta' pulling data from only 'male' donors
# Remember both of these metadata already exit in the seurat object, so be careful when creating your metadata substructure
betacells <- subset(pancreas.integrated, subset = (CellType == "Beta") & (sex == "male"))

# Point your new cluster towards the object you will use to perform calculations.
# I like doing this because otherwise, you have to write lengths of redundant code
# Also I'm like really lazy with code
ThisWayIsTotallyMentalButItWorks <- betacells
GOI1 <- 'PDX1' #you will have to name your first gene here, im choosing PDX1 as an example
GOI2 <- 'INS' #you will have to name your second gene here, im choosing INS as an example
GOI1.cutoff <- 1 #Assumption: gene count cutoff is 1, assuming atleast 1 count is REAL
GOI2.cutoff <- 1 #Assumption: gene count cutoff is 1, assuming atleast 1 count is REAL

# Time to party
GOI1.cells <- length(which(FetchData(ThisWayIsTotallyMentalButItWorks, vars = GOI1) > GOI1.cutoff))
GOI2.cells <- length(which(FetchData(ThisWayIsTotallyMentalButItWorks, vars = GOI2) > GOI2.cutoff))
GOI1_GOI2.cells <- length(which(FetchData(ThisWayIsTotallyMentalButItWorks, vars = GOI2) > GOI2.cutoff & FetchData(ThisWayIsTotallyMentalButItWorks, vars = GOI1) > GOI1.cutoff))
all.cells.incluster <- table(ThisWayIsTotallyMentalButItWorks@active.ident)
GOI1.cells/all.cells.incluster*100 # Percentage of cells in Beta that express GOI1
GOI2.cells/all.cells.incluster*100 #Percentage of cells in Beta that express GOI2
GOI1_GOI2.cells/all.cells.incluster*100 #Percentage of cells in Beta that co-express GOI1 + GOI2

Enjoy!

🐲 Thanks a lot for sharing the code. I have a question after using this code to calculate the percentage of cells co-expressing two genes, I got a percentage > 100% for some of my values. Does someone know what may be the reason?

Thanks!

bluedominion commented 1 year ago

It's a bit of bringing something back from the dead, but I wanted to share some expansion I did on @Ryan-Zhu 's super useful code. I took what he wrote, added the ability to check other assays, data slots (datatype), set a threshold, and return a prettier table when checking multiple genes. Likely it is a hacky way to do it, but it gets the job done. Feel free to suggest improvements! I still end up using @Dragonmasterx87 's method for co-expression.

PrctCellExpringGene <- function(object, genes, group.by = "all", assay = "RNA", datatype = "counts", threshold = 0){
    if(group.by == "all"){
        prct = unlist(lapply(genes,calc_helper, object=object, assay = assay, datatype=datatype, threshold=threshold))
        result = data.frame(Markers = genes, Cell_proportion = prct)
        return(result)
    }

    else{        
        list = SplitObject(object, group.by)
        factors = names(list)
        results = lapply(list, PrctCellExpringGene, genes=genes, assay = assay, datatype=datatype, threshold=threshold)
        results %>% reduce(full_join, by="Markers") %>% select(any_of("Markers")) -> genelist
        results %>% reduce(full_join, by="Markers") %>% select(!any_of("Markers")) %>% "colnames<-"(names(results)) -> percentages
        combined <- cbind(genelist,percentages)
        return(combined)
    }
}

calc_helper <- function(object,genes,assay,datatype,threshold){
    counts = slot(object[[assay]],datatype)
    ncells = ncol(counts)
    if(genes %in% row.names(counts)){
    round((sum(counts[genes,]>threshold)/ncells)*100,1)
    }else{return(NA)}
}

So this

PrctCellExpringGene(data, genes = c("TOP2A","MKI67","B2M","GAPDH"), 
                    group.by = "SCT_snn_res.0.1", assay = "RNA", datatype = "counts", threshold = 0)

Will give image

Enjoy! Maybe the Seurat folks will drag in our script(s) some day!

Elo-mars commented 6 months ago

@Ryan-Zhu Hi, How can I get the percent of cells expressing two genes (double positive cells for example KRT18+/APOE+)?

did you ever received a reply on that? I'm interested :)

bluedominion commented 6 months ago

I actually cooked up some code for that at some point! Warning: I wrote it for Seurat-v4, I don't know how nicely it will play with v5.

PrctCellCoExprGene <- function(object, gene1, gene2, group.by = "all", assay = "RNA", datatype = "counts", threshold = 0){
    if(group.by == "all"){
        prct = unlist(coex_helper(object=object,gene1=gene1,gene2=gene2,assay=assay,datatype=datatype,threshold=threshold))
        result = data.frame(Status = c(paste0(gene1,"-only.pos"),paste0(gene2,"-only.pos"),"Double.pos","Double.neg"), Cell_proportion = prct)
        return(result)
    }

    else{        
        list = SplitObject(object, group.by)
        factors = names(list)
        results = lapply(list, PrctCellCoExprGene, gene1=gene1, gene2=gene2, assay=assay, datatype=datatype, threshold=threshold)
        results %>% reduce(full_join, by="Status") %>% select(any_of("Status")) -> statuslist
        results %>% reduce(full_join, by="Status") %>% select(!any_of("Status")) %>% "colnames<-"(names(results)) -> percentages
        combined <- cbind(statuslist,percentages)
        return(combined)
    }
}

coex_helper <- function(object,gene1,gene2,assay,datatype,threshold){
    counts = slot(object[[assay]],datatype)
    ncells = ncol(counts)
    if(gene1 %in% row.names(counts)){
        if(gene2 %in% row.names(counts)){
        gene1.pos <- round((sum(counts[gene1,] > threshold & counts[gene2,] <= threshold)/ncells)*100,1)
        gene2.pos <- round((sum(counts[gene2,] > threshold & counts[gene1,] <= threshold)/ncells)*100,1)
        double.pos <- round((sum(counts[gene1,] > threshold & counts[gene2,] > threshold)/ncells)*100,1)
        double.neg <- round((sum(counts[gene1,] <= threshold & counts[gene2,] <= threshold)/ncells)*100,1)
            c(gene1.pos,gene2.pos,double.pos,double.neg)
        }else{return(NA)}
        }else{return(NA)}
    }

So usage would be: PrctCellCoExprGene(object, gene1, gene2, group.by = "all", assay = "RNA", datatype = "counts", threshold = 0)

I hope it helps you!

Elo-mars commented 6 months ago

@bluedominion , sorry for the late reply. In the meantime, I tried something else, but I will definitely keep your code for the next time someone asks me that!!! that is much more straightforward to what I did. Thanks!