stan-dev / rstan

RStan, the R interface to Stan
https://mc-stan.org
1.03k stars 264 forks source link

Issue running rstan with BiocParallel MulticoreParam #903

Open phauchamps opened 3 years ago

phauchamps commented 3 years ago

Summary:

When running the same stan model in parallel on a high number of datasets on multiple cores, using BiocParallel package with MulticoreParam back-end, I get an error message linked to shared objects loading.

Description:

In the context of a proteomics research work, I’d like to run the same model on a number of different datasets (1000+) in parallel on a number of cores (6 in my case). For this I am trying to use BiocParallel with its MulticoreParam back-end (on Linux). Note I don’t use the paralelization feature present in Stan (cores = 1).

This works fine for a fairly reasonable number of models (100) but when increasing further the number of models, while keeping the same number of cores, I get systematically an error message :

Error: BiocParallel errors element index: 271 (or other element index depending on run) unable to load shared object ‘tmp/Rtmp7TxrWl/file249d104be97c44.so’ tmp/Rtmp7TxrWl/file249d104be97c44.so : file too short

and as soon as this happens the rest of the jobs all fail with the same type of error message.

I tried to run the batch in serial mode (SerialParam in BiocParallel) and this works fine, so it is unlikely to be due to the data specifics of one model in the series.

Since I suspected it might be related to a resource shortage issue (e.g. memory), I also tried to decrease the number of cores used in order to limit the number of jobs run simultaneously, but even with 2 cores the issue appears. I also tried to decrease the number of chain iterations to a very low number but again the issue is still there.

Anyone having experienced the same kind of issue in the past and having found a solution for this ?

Reproducible Steps:

Difficult to provide anything reproducible, since I understand that to get the error you should have the same environment (OS etc.)

RStan Version:

2.21.2

R Version:

4.0.3

Operating System:

Manjaro Linux 20.2.1

phauchamps commented 3 years ago

I finally created a simpler case that allowed me to reproduce the error on a more limited scale. While playing with it, I noticed that when I was first removing the precompiled model from the disk (.rds file), and let Stan recompile the model before launching the tasks, the sharing of the compiled model to the different tasks could be done without any error occuring. While when I was reading the precompiled model from disk, the above described error sysmatically happaned.

I think the mistake probably lies in the following piece of code :

  1. check that modelScript.stan exists stanScriptFile ← paste0(modelScript, “.stan”) if(!file.exists(stanScriptFile)) stop(paste0(stanScriptFile, " does not exist!"))

  2. check if modelScript.rds exists.

  3. if not, compile it. Then save it as rds.

  4. if modelScript.rds exists, make sure it is more recent than modelScript.stan. 4 if more recent, load it, otherwise execute step 2 stanModelFile ← paste0(modelScript, “.rds”) compile ← TRUE if (file.exists(stanModelFile)){ fileTimes ← file.mtime(c(stanScriptFile, stanModelFile)) if(fileTimes[2] > fileTimes[1]) compile ← FALSE }

if(compile) { cat(paste0("Compiling Stan script : ", stanScriptFile, “\n”)) stanc_ret ← stanc(file = stanScriptFile, verbose = TRUE)

stan_mod <- stan_model(stanc_ret = stanc_ret, verbose = TRUE, auto_write = TRUE) cat("Model compilation successful! Wrighting model on disk...\n") saveRDS(object = stan_mod, file = stanModelFile) cat("Done!\n") } else { cat(paste0("Found an updated Stan model : ", stanModelFile, “\n”)) cat(“Uploading…”) stan_mod ← readRDS(file = stanModelFile) cat(“Done!\n”) }

// then call BiocParallel

used_bp <- bpparam()

cat("Launching all simple STAN models in parallel...\n") modIndex <- 1:nModels res <- bpmapply(FUN = runOneModel, index = modIndex, MoreArgs = list(compiledSTANModel = stan_mod, seed = seed, stanSamplingArgs = stanSamplingArgs), SIMPLIFY = FALSE, BPPARAM = used_bp)