Closed ghost closed 7 years ago
Can you just double check that the colnames and ncol of counts(eset)
matches the names and length of eset$Zone
?
Sure thing:
> length(eset$Zone)
[1] 1200
> length(eset$Zone) == ncol(counts(eset))
[1] TRUE
>sum(names(eset$Zone) == colnames(counts(eset)))
[1] 1200
I am working on distilling the dataset down to a minimal set that reproduces the problem.
I figured out the problem. I was running out of memory. The particular node I was running on had 28 cores and 256GB of RAM...that is not enough ram to copy the dataset for each subprocess on 28 cores. Thus, the fitting on some cell-pairs were failing, and that was corrupting the dimensions of the resulting dataset. Since these were sub-processes spawned by mclapply, STDERR and STDOUT disappeared into cyberspace so I was not aware of the out of memory failures until my sys admin brought to my attention that there were a lot of memory errors showing up in dmesg.
I reran scde.error.models
with n.cores=10
. Then a different error came up along the lines of Error: long vectors not supported yet
. It appears that with only 10 cores and 1200 cells, the resulting data chunks are too large for mclapply to send back. (See stack overflow discussion). So I made one change to the papply function in the scde package:
--- functions.R.old 2017-01-26 14:54:31.256338963 -0500
+++ functions.R 2017-01-26 06:26:14.296675165 -0500
@@ -6051,7 +6051,7 @@
if(n.cores>1) {
# bplapply implementation
if(is.element("parallel", installed.packages()[,1])) {
- mclapply(...,mc.cores=n.cores)
+ mclapply(...,mc.cores=n.cores,mc.preschedule=F)
} else {
# last resort
bplapply(... , BPPARAM = MulticoreParam(workers = n.cores))
Then the fitting completed without any errors. Turning off the prescheduling makes things quite a bit slower, but not as slow as not being able to complete with out errors =) at all.
I suppose since single cell datasets are going to be getting larger and larger you could consider adding an mc.preschedule option to the top levels function at some point?
Ah thanks for the excellent investigative work 👍
We are actually currently in the process of revamping all the error modeling in order to handle these larger and larger single cell datasets so hopefully these bugs will be replaced shortly!
I am attempting to process a dataset of about 1200 cells with scde. This works fine with a small subset of the data, but with the full dataset I get an error, which I presume is related to the cross fit failing on a relatively small number of pairs. Below is the call and resulting error as well as the outout of
traceback()
, andsessionInfo()
:Any pointers on addressing this? I could work on matching the
names
attribute to exclude failing pairs, but I am not clear on what the downstream consequences are of the failed pairs, or whether failing pairs suggests a deeper problem with the data and the need to remove those cells.