Closed ncborcherding closed 4 years ago
Hi, Nick:
Is it possible to just pull mouse gene sets from http://bioinf.wehi.edu.au/MSigDB/? http://bioinf.wehi.edu.au/MSigDB/v7.1/ so no need to do gene name conversion. thanks for developing the Escape package.
Actually this should be easier now - implemented a change that should pull the correct gene names (no ortholog conversion needed). However, the gene returns still as capitalized - so WNT5A as opposed to Wnt5a. The new fix just entails converting to the correct gene nomenclature capitalizations scheme - the following loop should work substantially faster:
for (i in seq_along(GS)) {
GS[[i]]@geneIds <- tolower(GS[[i]]@geneIds)
for (j in seq_len(length(GS[[i]]@geneIds))) {
GS[[i]]@geneIds[j] <- paste(toupper(substring(GS[[i]]@geneIds[j], 1,1)), substring(GS[[i]]@geneIds[j], 2),
sep="", collapse=" ")
}
}
Just based on the submission guidelines for bioconductor, I unfortunately cannot use links for pulling the data or have the gene sets stored in the data (too big).
HI, Nick:
I first pull out mouse gene symbols from msigdbr , then use your code to make GS. I found this works faster for me. m_df = msigdbr(species = "Mus musculus", category = "H") gs <- unique(m_df$gs_name) ls <- list() for (i in seq_along(gs)) { tmp <- m_df[m_df$gs_name == gs[i],] tmp <- tmp$gene_symbol tmp <- unique(tmp) tmp <- GeneSet(tmp, setName=paste(gs[i])) ls[[i]] <- tmp } gs_mouse_H <- GeneSetCollection(ls)
Any suggestion for working with large GS? even when I downsampled seurat object, it still took long time to generate the heatmap. Is there a way to filter the most significant ones to plot? Is performPCA meant to do that? but I got this error when I do performPCA
PCA <- performPCA(enriched = ES2, groups = c("Type", "cluster"))
Error in prcomp.default(input, scale. = TRUE): cannot rescale a constant/zero column to unit variance Traceback:
Glad you found something that works and thanks for providing code - hopefully other users will benefit from that.
In terms of large GS/enrichment objects - you are correct, the greater the number of gene sets or single-cells the more comoputational time it will take.
In terms of the PCA error, this is not an issue with the performPCA()
function, but rather you have one or more gene sets with zero variance - so filtering for variance or significance might be a good idea.
thanks Nick. Will you please elaborate on the first point? Do you have an example to do the mean values for each cluster? and which plot do you use, still heatmap? thanks for your time
No problem at all:
So I will combine the the output of enrichIt()
with my meta data using AddMetaData()
and then:
meta <- seurat.object[[]] #Pulls meta data
##Use dplyr to summarise across using your grouping vectors
output <- meta %>%
group_by("Type", "cluster")) %>%
summarise(across(#Columns of interest), median))
You can feed the results into a ggplot with geom_tile or make a matrix out of it and use a heatmap R package - like pheatmap or complex heatmap.
Some of you might notice the difference in pulling say the hallmark library for human (50 gene sets) versus mouse (11 gene sets). This is a product of the annotation available in the msigdb R package. A fix would be to use the human library and then convert to the mouse nomenclature:
Although this is feasible for smaller libraries - the code is a little slow for larger gene set collections. Credit for the conversion goes to radiaj.