statOmics / tradeSeq

TRAjectory-based Differential Expression analysis for SEQuencing data
Other
238 stars 27 forks source link

Error parallelizing fitGAM #43

Closed trebbiano closed 4 years ago

trebbiano commented 4 years ago

Hi, I'm running v1.1.16 in R3.6.1 in a CentOS environment (more info below). I get the following error running fitGAM in parallel mode:

sce <- fitGAM(counts = counts, sds = crv1, nknots = 15, verbose = TRUE, BPPARAM = BPPARAM, parallel = TRUE)
Adding 23989 jobs ...
Submitting 23989 jobs in 12 chunks using cluster functions 'Multicore' ...
Error in .reduceResultsList(ids, fun, ..., missing.val = missing.val,  :
  All jobs must be have been successfully computed
In addition: Warning messages:
1: In .findKnots(nknots, pseudotime, wSamp) :
  Impossible to place a knot at all endpoints.Increase the number of knots to avoid this issue.
2: In parallel::mccollect(jobs, wait = FALSE, timeout = timeout) :
  2 parallel jobs did not deliver results

The same command runs fine without parallel = TRUE but the cpu time is prohibitively long. The error seems to come from batchtools but could be related to how tradeSeq prepares the jobs. Thanks in advance for looking into this!

Relevant upstream code:

register(BatchtoolsParam(workers=12))
BPPARAM <- BiocParallel::bpparam()
BPPARAM
class: BatchtoolsParam
  bpisup: FALSE; bpnworkers: 12; bptasks: 0; bpjobname: BPJOB
  bplog: FALSE; bpthreshold: INFO; bpstopOnError: TRUE
  bpRNGseed: NA; bptimeout: 2592000; bpprogressbar: FALSE
  bpexportglobals: TRUE
  bplogdir: NA
  bpresultdir: NA
  cluster type: multicore
  template: NA
  registryargs:
    file.dir: /gpfs/home/trebbiano/file67874bb6408
    work.dir: getwd()
    packages: character(0)
    namespaces: character(0)
    source: character(0)
    load: character(0)
    make.default: FALSE
  saveregistry: FALSE
  resources:
> sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS release 6.7 (Final)

Matrix products: default
BLAS:   /opt/applications/R/3.6.1/gnu/lib64/R/lib/libRblas.so
LAPACK: /opt/applications/R/3.6.1/gnu/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.utf8       LC_NUMERIC=C
 [3] LC_TIME=en_US.utf8        LC_COLLATE=C
 [5] LC_MONETARY=en_US.utf8    LC_MESSAGES=en_US.utf8
 [7] LC_PAPER=en_US.utf8       LC_NAME=C
 [9] LC_ADDRESS=C              LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods
[8] base

other attached packages:
[1] BiocParallel_1.20.1 tradeSeq_1.1.16     bigmemory_4.5.36
[4] Biobase_2.46.0      BiocGenerics_0.32.0 slingshot_1.4.0
[7] princurve_2.1.4     Seurat_3.1.4

loaded via a namespace (and not attached):
  [1] reticulate_1.14             tidyselect_1.0.0
  [3] RSQLite_2.2.0               AnnotationDbi_1.48.0
  [5] htmlwidgets_1.5.1           grid_3.6.1
  [7] combinat_0.0-8              docopt_0.6.1
  [9] Rtsne_0.15                  RNeXML_2.4.3
 [11] base64url_1.4               munsell_0.5.0
 [13] codetools_0.2-16            mutoss_0.1-12
 [15] ica_1.0-2                   future_1.16.0
 [17] withr_2.1.2                 colorspace_1.4-1
 [19] fastICA_1.2-2               uuid_0.1-4
 [21] zinbwave_1.8.0              stats4_3.6.1
 [23] SingleCellExperiment_1.8.0  ROCR_1.0-7
 [25] gbRd_0.4-11                 listenv_0.8.0
 [27] NMF_0.22.0                  Rdpack_0.11-1
 [29] slam_0.1-47                 GenomeInfoDbData_1.2.2
 [31] mnormt_1.5-6                bit64_0.9-7
 [33] pheatmap_1.0.12             rhdf5_2.30.1
 [35] batchtools_0.9.12           vctrs_0.2.4
 [37] TH.data_1.0-10              R6_2.4.1
 [39] doParallel_1.0.15           GenomeInfoDb_1.22.0
 [41] rsvd_1.0.3                  VGAM_1.1-2
 [43] locfit_1.5-9.1              bitops_1.0-6
 [45] DelayedArray_0.12.2         assertthat_0.2.1
 [47] scales_1.1.0                multcomp_1.4-12
 [49] gtable_0.3.0                phylobase_0.8.10
 [51] npsurv_0.4-0                globals_0.12.5
 [53] sandwich_2.5-1              rlang_0.4.5
 [55] genefilter_1.68.0           splines_3.6.1
 [57] lazyeval_0.2.2              brew_1.0-6
 [59] checkmate_2.0.0             reshape2_1.4.3
 [61] backports_1.1.5             tools_3.6.1
[63] gridBase_0.4-7              ggplot2_3.3.0
 [65] gplots_3.0.3                RColorBrewer_1.1-2
 [67] ggridges_0.5.2              TFisher_0.2.0
 [69] Rcpp_1.0.4                  plyr_1.8.6
 [71] progress_1.2.2              zlibbioc_1.32.0
 [73] purrr_0.3.3                 RCurl_1.98-1.1
 [75] densityClust_0.3            prettyunits_1.1.1
 [77] pbapply_1.4-2               viridis_0.5.1
 [79] cowplot_1.0.0               S4Vectors_0.24.3
 [81] zoo_1.8-7                   SummarizedExperiment_1.16.1
 [83] ggrepel_0.8.2               cluster_2.1.0
 [85] fs_1.3.2                    magrittr_1.5
 [87] data.table_1.12.8           RSpectra_0.16-0
 [89] lmtest_0.9-37               RANN_2.6.1
 [91] mvtnorm_1.1-0               fitdistrplus_1.0-14
 [93] matrixStats_0.56.0          hms_0.5.3
 [95] patchwork_1.0.0             lsei_1.2-0
 [97] xtable_1.8-4                XML_3.99-0.3
 [99] sparsesvd_0.2               IRanges_2.20.2
[101] gridExtra_2.3               HSMMSingleCell_1.6.0
[103] compiler_3.6.1              tibble_2.1.3
[105] KernSmooth_2.23-16          crayon_1.3.4
[107] htmltools_0.4.0             mgcv_1.8-31
[109] tidyr_1.0.2                 howmany_0.3-1
[111] DBI_1.1.0                   MASS_7.3-51.5
[113] rappdirs_0.3.1              Matrix_1.2-18
[115] ade4_1.7-15                 gdata_2.18.0
[117] metap_1.3                   igraph_1.2.4.2
[119] GenomicRanges_1.38.0        pkgconfig_2.0.3
[121] bigmemory.sri_0.1.3         rncl_0.8.4
[123] sn_1.5-5                    registry_0.5-1
[125] numDeriv_2016.8-1.1         locfdr_1.1-8
[127] plotly_4.9.2                xml2_1.2.5
[129] foreach_1.4.8               annotate_1.64.0
[131] rngtools_1.5                pkgmaker_0.31
[133] multtest_2.40.0             XVector_0.26.0
[135] bibtex_0.4.2.2              stringr_1.4.0
[137] digest_0.6.25               sctransform_0.2.1
[139] RcppAnnoy_0.0.16            tsne_0.1-3
[141] softImpute_1.4              DDRTree_0.1.5
[143] leiden_0.3.3                uwot_0.1.8
[145] edgeR_3.28.1                kernlab_0.9-29
[147] gtools_3.8.1                lifecycle_0.2.0
[149] monocle_2.14.0              nlme_3.1-145
[151] jsonlite_1.6.1              Rhdf5lib_1.8.0
[153] clusterExperiment_2.6.1     viridisLite_0.3.0
[155] limma_3.42.2                pillar_1.4.3
[157] lattice_0.20-40             httr_1.4.1
[159] plotrix_3.7-7               survival_3.1-11
[161] glue_1.3.2                  qlcMatrix_0.9.7
[163] FNN_1.1.3                   png_0.1-7
[165] iterators_1.0.12            bit_1.1-15.2
[167] HDF5Array_1.14.3            stringi_1.4.6
[169] blob_1.2.1                  memoise_1.1.0
[171] caTools_1.18.0              dplyr_0.8.5
[173] irlba_2.3.3                 future.apply_1.4.0
[175] ape_5.3
koenvandenberge commented 4 years ago

Hi @trebbiano

I'm not sure how much we can help with this; tradeSeq is relying on BiocParallel to do the parallelization and it seems the error is there rather than in tradeSeq.

It does seem that fitGAM may error for some of your jobs given the 2 parallel jobs did not deliver results message. It might be worth trying to parallelize the fitting of only a small number of genes (say 50) with only a couple of knots (say 6) to see if that runs successfully. You may want to install the latest version from GitHub since we've implemented a new way to catch the fitGAM errors, maybe that helps.

Let us know if that works for you.

trebbiano commented 4 years ago

Thanks for the reply. I have updated to the latest tradeSeq, version 1.3.0. This time I see a different error:

> sce5k <- fitGAM(counts = countsOrdered, sds = crv1, nknots=15, verbose=TRUE, genes=1:5000, parallel=TRUE, BPPARAM=BPPARAM)
Adding 2265 jobs ...
Submitting 2265 jobs in 12 chunks using cluster functions 'Multicore' ...
Clearing registry ...
Error in names(res) <- nms :
  'names' attribute [5000] must be the same length as the vector [2265]

2265 is the number of cells in this dataset and is equivalent to the 23989 above, otherwise the datasets are equivalent.

I also tried the sub-sampling at lower knots number as you suggested, which produces a yet different error:

> sce100knots6 <- fitGAM(counts = countsOrdered, sds = crv1, nknots=6, verbose=TRUE, genes=1:100, BPPARAM=BPPARAM, parallel=TRUE)
Adding 2265 jobs ...                                                                                                                   
Submitting 2265 jobs in 12 chunks using cluster functions 'Multicore' ...                                                              
Clearing registry ...                                                                                                                  
Error: BiocParallel errors                                                                                                             
  element index: 1, 2, 3, 4, 5, 6, ...                                                                                                 
  first error: unused arguments (B2M = 120, TMSB4X = 48, MT.CO1 = 59, RPL41 = 48, EEF1A1 = 57, RPL13 = 55, RPS12 = 46, RPL10 = 41, MT.A
TP6 = 52, MT.CO2 = 49, RPLP1 = 63, TPT1 = 35, RPS27 = 51, RPS18 = 55, MT.CO3 = 31, RPL28 = 39, RPS19 = 17, RPS15A = 32, MT.CYB = 31, RP
S27A = 33, RPS2 = 24, RPL34 = 25, RPL30 = 46, RPS14 = 34, TMSB10 = 25, RPL32 = 31, RPS3 = 26, RPS23 = 26, RPS24 = 29, RPS8 = 28, RPL19 
= 24, GNLY = 0, RPS3A = 22, RPL11 = 26, RPL3 = 20, ACTB = 22, RPL39 = 22, CCL5 = 8, RPL37 = 19, PTMA = 30, RPL26 = 24, RPL21 = 28, RPS2
9 = 19, MT.ND3 = 27, RPL13A = 38, RPS6 = 23, RPL18A = 26, RPS28 = 25, RPLP2 = 18, MT.ND4 = 28, RPL12 = 15, RPS15 = 21, RPL7A = 11, AC00
4086.1 = 26, RPL23A = 23, RPS7 = 16, RPS25 = 16, RPS4X = 15, NKG7 = 1, RPL35A = 17, RPL14 = 10, RPL18 = 22, RPL27A = 27, HLA.A = 13, FA
U = 11, RPL9 = 21, RPL15 = 17, HLA.C = 12, RPL29 = 10, RPS16 = 25, HLA.B = 12, RPS21 = 20, RPL37A = 16, RPL5 = 16, RPS9 = 1            

I never encountered these issues with the parallel mode switched off.

koenvandenberge commented 4 years ago

Hi @trebbiano

Thank you for checking that. The error may be a bug in subsetting the genes, I'l llook into that and keep you updated. Considering parallelization, could you please check if the following runs, just to check whether the error there isn't coming from the subsetting using the genes argument, too:

sce5k <- fitGAM(counts = countsOrdered[1:200,], sds = crv1, nknots=6, verbose=TRUE, parallel=TRUE, BPPARAM=BPPARAM)
trebbiano commented 4 years ago

For 200 genes, this direct subsetting gives the same error message as using genes=1:200:

> sce200Sub <- fitGAM(counts = countsOrdered[1:200,], sds = crv1, nknots=6, verbose=TRUE, parallel = TRUE, BPPARAM=BPPARAM) 
Adding 2265 jobs ...
Submitting 2265 jobs in 12 chunks using cluster functions 'Multicore' ...
Clearing registry ...
Error: BiocParallel errors
  element index: 1, 2, 3, 4, 5, 6, ...
  first error: unused arguments (B2M = 120, TMSB4X = 48, MT.CO1 = 59, RPL41 = 48, EEF1A1 = 57, RPL13 = 55, RPS12 = 46, RPL10 = 41, MT.ATP6 = 52, MT.CO2 = 49, RPLP1 = 63, TPT1 = 35, RPS27 = 51, RPS18 = 55, MT.CO3 = 31, RPL28 = 39, RPS19 = 17, RPS15A = 32, MT.CYB = 31, RPS27A = 33, RPS2 = 24, RPL34 = 25, RPL30 = 46, RPS14 = 34, TMSB10 = 25, RPL32 = 31, RPS3 = 26, RPS23 = 26, RPS24 = 29, RPS8 = 28, RPL19 = 24, GNLY = 0, RPS3A = 22, RPL11 = 26, RPL3 = 20, ACTB = 22, RPL39 = 22, CCL5 = 8, RPL37 = 19, PTMA = 30, RPL26 = 24, RPL21 = 28, RPS29 = 19, MT.ND3 = 27, RPL13A = 38, RPS6 = 23, RPL18A = 26, RPS28 = 25, RPLP2 = 18, MT.ND4 = 28, RPL12 = 15, RPS15 = 21, RPL7A = 11, AC004086.1 = 26, RPL23A = 23, RPS7 = 16, RPS25 = 16, RPS4X = 15, NKG7 = 1, RPL35A = 17, RPL14 = 10, RPL18 = 22, RPL27A = 27, HLA.A = 13, FAU = 11, RPL9 = 21, RPL15 = 17, HLA.C = 12, RPL29 = 10, RPS16 = 25, HLA.B = 12, RPS21 = 20, RPL37A = 16, RPL5 = 16, RPS9 = 1

For completeness I also tried the direct subsetting for the larger gene set and higher knots which produced the same error as genes=1:5000:

> sce5kSub <- fitGAM(counts = countsOrdered[1:5000,], sds = crv1, nknots=15, verbose=TRUE, parallel = TRUE, BPPARAM=BPPARAM) Adding 2265 jobs ...
Submitting 2265 jobs in 12 chunks using cluster functions 'Multicore' ...
Clearing registry ...
Error in names(res) <- nms :
  'names' attribute [5000] must be the same length as the vector [2265]

Thanks for looking into this!

koenvandenberge commented 4 years ago

Hi @trebbiano

I'm not getting any errors when parallelizing the fitting using your settings (but reduced to 2 workers), also not when subsetting genes. I tried this by borrowing code from #37 which I paste below. Could you check if this errors for you? It may be worth updating tradeSeq to the latest version.

library(splatter)
library(tradeSeq)
library(SingleCellExperiment)
library(slingshot)

nGenes <- 60179
numCells <- 100
current_seed <- 328585
params <- newSplatParams()

#Simulate no genes to be DE such that we would expect a small number of rejections
SplatterSimObject <- splatSimulate(params,
                                   method="paths",
                                   nGenes=nGenes,
                                   batchCells=numCells,
                                   seed=current_seed,
                                   lib.loc=11.49293, de.prob = 0, de.facLoc = log(2)*3,
                                   out.prob = 0)

countsT <- counts(SplatterSimObject)

filt_func <- function(x){
  ncells_high_exp <- sum(x >= 10)
  return(ncells_high_exp)
}
rows_to_keep <- apply(countsT, 1, filt_func)
counts <- countsT[rows_to_keep > 10,]

FQnorm <- function(counts){
  rk <- apply(counts,2,rank,ties.method='min')
  counts.sort <- apply(counts,2,sort)
  refdist <- apply(counts.sort,1,median)
  norm <- apply(rk,2,function(r){ refdist[r] })
  rownames(norm) <- rownames(counts)
  return(norm)
}
norm_counts <- FQnorm(counts)

SCEObj <- SingleCellExperiment(assays = List(counts = counts, norm_counts = norm_counts))

pca <- prcomp(t(log1p(assays(SCEObj)$norm_counts)), scale. = FALSE)

rd <- pca$x[,1:2]

reducedDims(SCEObj) <- SimpleList(PCA = rd)

cl <- kmeans(rd, centers = 4)$cluster
colData(SCEObj)$kMeans <- cl

slingshot_results <- slingshot(SCEObj, clusterLabels = 'kMeans', reducedDim = 'PCA')

lin <- getLineages(SCEObj, clusterLabels = colData(slingshot_results)$kMeans, reducedDim = 'PCA')

crv <- SlingshotDataSet(getCurves(lin))

register(BatchtoolsParam(workers=2))
BPPARAM <- BiocParallel::bpparam()
BPPARAM

sce <- fitGAM(counts = counts, sds = crv, nknots = 6, verbose = TRUE, BPPARAM = BPPARAM, parallel = TRUE)

sce <- fitGAM(counts = counts, sds = crv, nknots = 6, genes=1:200, verbose = TRUE, BPPARAM = BPPARAM, parallel = TRUE)
trebbiano commented 4 years ago

Thanks for working on this issue. I was able to run this code using up to date release versions (+tradeSeq to 1.3.0) of the packages but I encountered a similar error as before. If you are not seeing the error it could be something specific to the BPPARAM variable in my environment. Here is the error:

> sce <- fitGAM(counts = counts, sds = crv, nknots = 6, verbose = TRUE, BPPARAM = BPPARAM, parallel = TRUE)
Adding 100 jobs ...
Submitting 100 jobs in 2 chunks using cluster functions 'Multicore' ...
Clearing registry ...
Error in names(res) <- nms :
  'names' attribute [4921] must be the same length as the vector [100]
> sce <- fitGAM(counts = counts, sds = crv, nknots = 6, genes=1:200, verbose = TRUE, BPPARAM = BPPARAM, parallel = TRUE)
Adding 100 jobs ...
Submitting 100 jobs in 2 chunks using cluster functions 'Multicore' ...
Clearing registry ...
Error in names(res) <- nms :
  'names' attribute [200] must be the same length as the vector [100]

Traceback:

> traceback()
8: BiocParallel::bplapply(as.data.frame(t(as.matrix(counts)[id,
       ])), counts_to_Gam, BPPARAM = BPPARAM)
7: BiocParallel::bplapply(as.data.frame(t(as.matrix(counts)[id,
       ])), counts_to_Gam, BPPARAM = BPPARAM)
6: .fitGAM(counts = counts, U = U, pseudotime = pseudotime, cellWeights = cellWeights,
       genes = genes, weights = weights, offset = offset, nknots = nknots,
       verbose = verbose, parallel = parallel, BPPARAM = BPPARAM,
       control = control, sce = sce, family = family)
5: .local(counts, ...)
4: fitGAM(counts = counts, sds = crv, nknots = 6, genes = 1:200,
       verbose = TRUE, BPPARAM = BPPARAM, parallel = TRUE)
3: fitGAM(counts = counts, sds = crv, nknots = 6, genes = 1:200,
       verbose = TRUE, BPPARAM = BPPARAM, parallel = TRUE)
2: .traceback(x)
1: traceback(sce <- fitGAM(counts = counts, sds = crv, nknots = 6,
       genes = 1:200, verbose = TRUE, BPPARAM = BPPARAM, parallel = TRUE))

Here is my BPPARAM:

> BPPARAM
class: BatchtoolsParam
  bpisup: FALSE; bpnworkers: 2; bptasks: 40; bpjobname: BPJOB
  bplog: FALSE; bpthreshold: INFO; bpstopOnError: TRUE
  bpRNGseed: NA; bptimeout: 2592000; bpprogressbar: TRUE
  bpexportglobals: TRUE
  bplogdir: NA
  bpresultdir: NA
  cluster type: multicore
  template: NA
  registryargs:
    file.dir: /gpfs/home/trebbiano/file772e5b48e335
    work.dir: getwd()
    packages: character(0)
    namespaces: character(0)
    source: character(0)
    load: character(0)
    make.default: FALSE
  saveregistry: FALSE
  resources:

Does this differ from yours? I would guess a hardware configuration difference but then why would that produce a different size vector? Could BiocParallel implement different methods for different hardware configurations even if BPPARAM are the same?

koenvandenberge commented 4 years ago

Hi @trebbiano, since you are experiencing the error using the same code but I cannot reproduce it I assume it must be something about either the system or the BPPARAM settings. However, I am currently unsure which BPPARAM parameters might cause the issues as I do not know enough about how each parameter influences the parallelization to give a useful response to that. However, I did recently run fitGAM using parallelization on a Linux compute cluster...

Here is my output for BPPARAM:

> BPPARAM
class: MulticoreParam
  bpisup: FALSE; bpnworkers: 6; bptasks: 0; bpjobname: BPJOB
  bplog: FALSE; bpthreshold: INFO; bpstopOnError: TRUE
  bpRNGseed: ; bptimeout: 2592000; bpprogressbar: FALSE
  bpexportglobals: TRUE
  bplogdir: NA
  bpresultdir: NA
  cluster type: FORK
trebbiano commented 4 years ago

That looks quite different! Let me look into what the parameter settings mean and if I can adjust them to look similar to yours. If they are not enforced by the hardware configuration it may be possible.

koenvandenberge commented 4 years ago

Hi @trebbiano Yesterday I've experienced the same error you've experienced, however, I knew that I ran that code before on that system (it was a Linux cluster). So I just tried again and then it worked... So it seems like something goes wrong with the parallelization in BiocParallel. It's possibly better to try a bit fewer cores to lower the probability of errors.