hms-dbmi / scde

R package for analyzing single-cell RNA-seq data
http://pklab.med.harvard.edu/scde
Other
170 stars 64 forks source link

Error in names(rl) #36

Closed ghost closed 7 years ago

ghost commented 7 years ago

I am attempting to process a dataset of about 1200 cells with scde. This works fine with a small subset of the data, but with the full dataset I get an error, which I presume is related to the cross fit failing on a relatively small number of pairs. Below is the call and resulting error as well as the outout of traceback(), and sessionInfo():

> o.ifm <- scde.error.models(counts=counts(eset),
+     groups=eset$Zone,
+     n.cores=40,
+     min.nonfailed = 30,
+     threshold.segmentation=TRUE,
+     save.crossfit.plots=FALSE,
+     save.model.plots=FALSE,
+     verbose=1)

cross-fitting cells.
number of pairs:  12403
reducing to a random sample of  5000  pairs
number of pairs:  9180
reducing to a random sample of  5000  pairs
number of pairs:  9591
reducing to a random sample of  5000  pairs
number of pairs:  8778
reducing to a random sample of  5000  pairs
number of pairs:  7260
reducing to a random sample of  5000  pairs
number of pairs:  10153
reducing to a random sample of  5000  pairs
number of pairs:  7626
reducing to a random sample of  5000  pairs
number of pairs:  9591
reducing to a random sample of  5000  pairs
number of pairs:  5671
reducing to a random sample of  5000  pairs
total number of pairs:  49956
cross-fitting 49956 pairs:
Error in names(rl) <- apply(cl, 2, paste, collapse = ".vs.") :
  'names' attribute [49956] must be the same length as the vector [49920]

>traceback()
2: calculate.crossfit.models(counts, groups, n.cores = n.cores,
       threshold.segmentation = threshold.segmentation, min.count.threshold = min.count.threshold,
       zero.lambda = zero.lambda, max.pairs = max.pairs, save.plots = save.crossfit.plots,
       min.pairs.per.cell = min.pairs.per.cell, verbose = verbose)
1: scde.error.models(counts = counts(eset), groups = eset$Zone,
       n.cores = 40, min.nonfailed = 30, threshold.segmentation = TRUE,
       save.crossfit.plots = FALSE, save.model.plots = FALSE, verbose = 1)

>sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods
[8] base

other attached packages:
[1] scater_1.2.0        ggplot2_2.2.1       Biobase_2.34.0
[4] BiocGenerics_0.20.0 scde_1.99.4         flexmix_2.3-13
[7] lattice_0.20-34

loaded via a namespace (and not attached):
 [1] viridis_0.3.4             edgeR_3.16.5
 [3] splines_3.3.2             shiny_1.0.0
 [5] assertthat_0.1            stats4_3.3.2
 [7] vipor_0.4.4               RSQLite_1.1-2
 [9] quantreg_5.29             limma_3.30.8
[11] digest_0.6.11             RColorBrewer_1.1-2
[13] minqa_1.2.4               colorspace_1.3-2
[15] htmltools_0.3.5           httpuv_1.3.3
[17] Matrix_1.2-7.1            plyr_1.8.4
[19] XML_3.98-1.5              biomaRt_2.30.0
[21] SparseM_1.74              zlibbioc_1.20.0
[23] xtable_1.8-2              scales_0.4.1
[25] brew_1.0-6                BiocParallel_1.8.1
[27] lme4_1.1-12               MatrixModels_0.4-1
[29] tibble_1.2                mgcv_1.8-16
[31] IRanges_2.8.1             car_2.1-4
[33] Lmoments_1.2-3            RMTstat_0.3
[35] nnet_7.3-12               lazyeval_0.2.0
[37] pbkrtest_0.4-6            magrittr_1.5
[39] distillery_1.0-2          mime_0.5
[41] memoise_1.0.0             nlme_3.1-128
[43] MASS_7.3-45               RcppArmadillo_0.7.600.1.0
[45] beeswarm_0.2.3            shinydashboard_0.5.3
[47] Cairo_1.5-9               Rook_1.1-1
[49] tools_3.3.2               data.table_1.10.0
[51] extRemes_2.0-8            matrixStats_0.51.0
[53] stringr_1.1.0             S4Vectors_0.12.1
[55] munsell_0.4.3             locfit_1.5-9.1
[57] AnnotationDbi_1.36.1      pcaMethods_1.66.0
[59] rhdf5_2.18.0              grid_3.3.2
[61] RCurl_1.95-4.8            nloptr_1.0.4
[63] tximport_1.2.0            rjson_0.2.15
[65] bitops_1.0-6              gtable_0.2.0
[67] DBI_0.5-1                 reshape2_1.4.2
[69] R6_2.2.0                  gridExtra_2.2.1
[71] dplyr_0.5.0               modeltools_0.2-21
[73] stringi_1.1.2             ggbeeswarm_0.5.3
[75] Rcpp_0.12.9

Any pointers on addressing this? I could work on matching the names attribute to exclude failing pairs, but I am not clear on what the downstream consequences are of the failed pairs, or whether failing pairs suggests a deeper problem with the data and the need to remove those cells.

JEFworks commented 7 years ago

Can you just double check that the colnames and ncol of counts(eset) matches the names and length of eset$Zone?

ghost commented 7 years ago

Sure thing:

> length(eset$Zone)
[1] 1200
> length(eset$Zone) == ncol(counts(eset))
[1] TRUE
>sum(names(eset$Zone) == colnames(counts(eset)))
[1] 1200

I am working on distilling the dataset down to a minimal set that reproduces the problem.

ghost commented 7 years ago

I figured out the problem. I was running out of memory. The particular node I was running on had 28 cores and 256GB of RAM...that is not enough ram to copy the dataset for each subprocess on 28 cores. Thus, the fitting on some cell-pairs were failing, and that was corrupting the dimensions of the resulting dataset. Since these were sub-processes spawned by mclapply, STDERR and STDOUT disappeared into cyberspace so I was not aware of the out of memory failures until my sys admin brought to my attention that there were a lot of memory errors showing up in dmesg.

I reran scde.error.models with n.cores=10. Then a different error came up along the lines of Error: long vectors not supported yet. It appears that with only 10 cores and 1200 cells, the resulting data chunks are too large for mclapply to send back. (See stack overflow discussion). So I made one change to the papply function in the scde package:

--- functions.R.old     2017-01-26 14:54:31.256338963 -0500
+++ functions.R 2017-01-26 06:26:14.296675165 -0500
@@ -6051,7 +6051,7 @@
   if(n.cores>1) {
     # bplapply implementation
     if(is.element("parallel", installed.packages()[,1])) {
-      mclapply(...,mc.cores=n.cores)
+      mclapply(...,mc.cores=n.cores,mc.preschedule=F)
     } else {
       # last resort
       bplapply(... , BPPARAM = MulticoreParam(workers = n.cores))

Then the fitting completed without any errors. Turning off the prescheduling makes things quite a bit slower, but not as slow as not being able to complete with out errors =) at all.

I suppose since single cell datasets are going to be getting larger and larger you could consider adding an mc.preschedule option to the top levels function at some point?

JEFworks commented 7 years ago

Ah thanks for the excellent investigative work 👍

We are actually currently in the process of revamping all the error modeling in order to handle these larger and larger single cell datasets so hopefully these bugs will be replaced shortly!