Issue Getting Mouse Gene Sets

ncborcherding commented 4 years ago

Some of you might notice the difference in pulling say the hallmark library for human (50 gene sets) versus mouse (11 gene sets). This is a product of the annotation available in the msigdb R package. A fix would be to use the human library and then convert to the mouse nomenclature:


##Load Escape
library(escape)

#Load Human Gene Sets
GS <- getGeneSets(library = "H")

##Get ensembl annotation
library("biomaRt")
human = useMart("ensembl", dataset = "hsapiens_gene_ensembl")
mouse = useMart("ensembl", dataset = "mmusculus_gene_ensembl")

##Define a function for converting human to mouse genes
convertHumanGeneList <- function(x){

genesV2 = getLDS(attributes = c("hgnc_symbol"), filters = "hgnc_symbol", values = x , mart = human, attributesL = c("mgi_symbol"), martL = mouse, uniqueRows=T)

humanx <- unique(genesV2[, 2])

return(humanx)
}

##loop the function from above
for (i in seq_along(GS)) {
    GS[[i]]@geneIds <- convertHumanGeneList(GS[[i]]@geneIds)
}

Although this is feasible for smaller libraries - the code is a little slow for larger gene set collections. Credit for the conversion goes to radiaj.

gt7901b commented 4 years ago

Hi, Nick:

Is it possible to just pull mouse gene sets from http://bioinf.wehi.edu.au/MSigDB/? http://bioinf.wehi.edu.au/MSigDB/v7.1/ so no need to do gene name conversion. thanks for developing the Escape package.

ncborcherding commented 4 years ago

Actually this should be easier now - implemented a change that should pull the correct gene names (no ortholog conversion needed). However, the gene returns still as capitalized - so WNT5A as opposed to Wnt5a. The new fix just entails converting to the correct gene nomenclature capitalizations scheme - the following loop should work substantially faster:

for (i in seq_along(GS)) {
    GS[[i]]@geneIds <- tolower(GS[[i]]@geneIds)
    for (j in seq_len(length(GS[[i]]@geneIds))) {
        GS[[i]]@geneIds[j] <- paste(toupper(substring(GS[[i]]@geneIds[j], 1,1)), substring(GS[[i]]@geneIds[j], 2),
      sep="", collapse=" ")
    }
}

Just based on the submission guidelines for bioconductor, I unfortunately cannot use links for pulling the data or have the gene sets stored in the data (too big).

gt7901b commented 4 years ago

HI, Nick:

I first pull out mouse gene symbols from msigdbr , then use your code to make GS. I found this works faster for me. m_df = msigdbr(species = "Mus musculus", category = "H") gs <- unique(m_df$gs_name) ls <- list() for (i in seq_along(gs)) { tmp <- m_df[m_df$gs_name == gs[i],] tmp <- tmp$gene_symbol tmp <- unique(tmp) tmp <- GeneSet(tmp, setName=paste(gs[i])) ls[[i]] <- tmp } gs_mouse_H <- GeneSetCollection(ls)

Any suggestion for working with large GS? even when I downsampled seurat object, it still took long time to generate the heatmap. Is there a way to filter the most significant ones to plot? Is performPCA meant to do that? but I got this error when I do performPCA

PCA <- performPCA(enriched = ES2, groups = c("Type", "cluster"))

Error in prcomp.default(input, scale. = TRUE): cannot rescale a constant/zero column to unit variance Traceback:

performPCA(enriched = ES2, groups = c("celltype", "cluster"))
prcomp(input, scale. = TRUE)
prcomp.default(input, scale. = TRUE)
stop("cannot rescale a constant/zero column to unit variance")

ncborcherding commented 4 years ago

Glad you found something that works and thanks for providing code - hopefully other users will benefit from that.

In terms of large GS/enrichment objects - you are correct, the greater the number of gene sets or single-cells the more comoputational time it will take.

One idea is to look at the median values for you cell type and cluster - so after attaching to the meta data, use dplyr to group by the categories of interest and then summarise(across()). This will substantially reduce the graphing time for heatmaps. (This is how I have been looking at the data first).
Another idea, is to get significant gene sets and subset for significance or interest or both.

In terms of the PCA error, this is not an issue with the performPCA() function, but rather you have one or more gene sets with zero variance - so filtering for variance or significance might be a good idea.

gt7901b commented 4 years ago

thanks Nick. Will you please elaborate on the first point? Do you have an example to do the mean values for each cluster? and which plot do you use, still heatmap? thanks for your time

ncborcherding commented 4 years ago

No problem at all:

So I will combine the the output of enrichIt() with my meta data using AddMetaData() and then:

meta <- seurat.object[[]] #Pulls meta data

##Use dplyr to summarise across using your grouping vectors
output <- meta %>%
  group_by("Type", "cluster")) %>%
  summarise(across(#Columns of interest), median))

You can feed the results into a ggplot with geom_tile or make a matrix out of it and use a heatmap R package - like pheatmap or complex heatmap.

ncborcherding / escape

Issue Getting Mouse Gene Sets #4