Closed SimoniMD closed 2 years ago
Hey Simoni,
So actually the enrichIt()
has been having issues with the parallelization via biocParallel (which I am working on and this might be a cause of the issue). The approach UCell is using is similar to the ranking underlying the ssGSEA method, but had several assumptions adapted to single-cell sequencing. For instance, it only ranks the top 1500 genes instead of all genes (which contributes to speed). It is also a faster implementation of AUCell with the code cleaned up for performance. It may have different pct differences based on these assumptions (as you noted), but should have relatively similar results to the standard ssGSEA.
I think in the end, the differences between the methods are probably not as important. In my experience, ssGSEA and UCell have a relatively high correlation. In fact, with the recent issues with BiocParallel - I have been running UCell exclusively on my pipelines. The difference in your samples might impact your results, but that is also true for differential gene expression. I think the fact you have biological replicates is far more important for the robustness of either gene set enrichment or differential expression.
Not knowing how many cells you have total - subsampling might be a good way to go. Another option is to bootstrap - randomly take 1000 cells (maybe like 30 times) and do the statistical test of choice. Cross-reference each iteration for the commonly altered gene sets? But towards your first point - as long as it is random, you should be fine subsampling.
I am going to close this issue as it does not pertain to issues with the package itself. But feel free to continue to comment and ask questions.
Nick
Great, thanks. I'll probably just continue to use Ucell then. It seems to be blazing along.
One error I encountered with it though for one of the pathways:
ES.seurat.c5bp <- enrichIt(obj = seurat_small, method = "UCell", gene.sets = GS.c5bp, groups = 1000, cores = 2)
Error in calculate_Uscore(m, features = features, maxRank = maxRank, chunk.size = chunk.size, :
One or more signatures contain more genes than maxRank parameter. Increase maxRank parameter or make shorter signatures
I dont see a place to increase the parameter in question?
This is in error stemming from UCell, looks like you have a gene set with greater than the default maxRank (or the number of genes the function will rank - so default is 1500). I have not set up passing arguments to Ucell parameter yet (still experimental).
So sounds like i won't be able to analyze with this gene.set on this object then?
Not as of right now, you can remove the large gene set with > 1500 genes or you can run Ucell directly and adjust maxrank.
Is there a parameter to limit the size of the genes set somewhere? or remove the large one(s) manually? I dont see it in the getGeneSets()
function.
Assuming your gene sets are "GS", the following loop should be able to filter the gene sets with > 1500 genes.
size <- NULL
for (x in seq_along(GS)) {
size <- c(size, length(GS[[x]]@geneIds))
}
remove <- unname(which(size >= 1500))
if (length(remove) != 0) {
GS <- GS[-remove]
}
Yup. that worked. You're a genius, sir.
Hello again! Two questions, roughly related...regarding method differences with
enrichIt()
So just like you said, UCell is much faster. I'm having hard time getting through my script using my seurat objects without R crashing using the regular method, but the regular method it can work if I do one at a time and have patience.
I've been comparing default method vs the Ucell method, using the pct difference between my conditions, and i've gotten different results. Just wondering what your thoughts are on this, and if my samples may have an impact on this. (to clarify, top 30 are kinda the same, just always in a different order).
For background, I have 7 human samples of 2 different conditions (4 MSE, 3 FT). NONE are biological replicates (the unfortunate aspect of human samples). All 7 are different, but taken at certain timepoints so they should be roughly similar.
Just wondering if you could shed some light on the differences, and if the point that my samples are all different makes an impact?
Follow-up question, if you prefer the default statistical method, i'm imaging that subsetting 1k cells from the original seurat object will make the function run faster, but then it kinda defeats the purpose of assigning a score to every cell (although i guess it would be a random 1k, so it should be representative)? Thoughts?