Issue with scmap-cluster function output/ Variable inconsistent assignment

finjen commented 3 years ago

Dear Dr. Kiselev, Dr. Hemberg,

I have been encountering an issue when running the scmap-cluster pipeline and would like to ask you for some input on this matter. It seems that independent runs result in variable and inconsistent assignments. There seem to be two main sets of results, one of which seems fair based on prior knowledge about the composition of the query dataset. The second set of results comprises assignments that seem to reflect an imperfect match rather than a complete mess. While it might still be to some degree possible that the assignment that I think is problematic is actually correct, the issue remains of the variability in the mapping results. I should stress that the "wrong" mapping occurs much more frequently than the one I feel would be correct, if I run scmap from scratch repeatedly, suggesting it may be right after all. Still, the variability troubles me.

I have possibly narrowed down the steps resulting in variable outcomes to the reference normalization (see script below). I tried increasing the number of selected features thinking that this may mitigate the impact of the previous steps, but that didn´t seem to be the case.

sf2 <- 2^rnorm(ncol(sce2)) sf2 <- sf2/mean(sf2) normcounts(sce2) <- t(t(counts(sce2))/sf2)

counts(sce2) <- normcounts(sce2) logcounts(sce2) <- log2(normcounts(sce2)+1) rowData(sce2)$feature_symbol<-rownames(sce2)

I would be happy to receive some suggestions from your side.

wikiselev commented 3 years ago

Hi, not sure that the steps in your script are stochastic... Also they do not include any scmap functions.

Regarding scmap-cluster function, if I remember correctly it is stochastic and therefore it will indeed give you different results for each run. One way to get a stable result is to average different cell assignments by taking the most frequent one after multiple runs.

Hope this helps!

mhemberg commented 3 years ago

Agee with what Vlad was saying, although if I understand it correctly, the issue is that it most often converges to the "incorrect" solution and the question is how to make it converge to the "correct" solution more frequently. Changing the parameters (not just the number of features) could help and another option is to use scmap-cell instead. Although it is slower, it could potentially yield better results.

hemberg-lab / scmap

Issue with scmap-cluster function output/ Variable inconsistent assignment #27