Closed MahmAbdelwahab closed 2 months ago
Hi @MahmAbdelwahab -
A couple of thoughts:
mod <- mread(..., soloc = "build-dir")
This makes sure to build the model locally rather than in the temporary directory specific to your (main) R session.
Rather than plan(multisession)
, can you try plan(callr)
from the future.callr
package? A bit of a long shot, but sometimes we have issues with multisession
.
Could you try caching the compiled model and then read back for each chunk? I know this is a bit inelegant and probably what you are trying to avoid, but that was the motivation for implementing those features (cache, soloc etc).
So if you load and cache the model locally prior to starting the parallel job
mod <- mread_cache(..., soloc = "build-dir")
Then for each chunk it's the same, but when you do this on the chunk, it's a quick read ... no complie:
mod <- mread_cache(..., soloc = "build-dir")
Could you let me know what happens. If it's still not working, please email me at my github email address and we can meet on zoom to look at this.
Kyle
Hello @kylebaron
Setting plan(callr)
has solved the issue!, also I have tested loading the cached model in every chunk and no significant difference in computation time.
many thanks for your help!
Best,
Mahmoud
Thanks for reporting back, @MahmAbdelwahab and glad it got resolved.
Wondering if you'd be willing to share the relevant parts of your setup? I've done this a long time ago with future batchtools on sge but it got unstable on our system. It sounds like your setup is working well outside of the multisession issue.
Kyle
Hello @kylebaron,
Here's the relevant parts of the setup, I will try to post a full example later if needed.
# setting up slurm plan
slurm <- future::tweak(future.batchtools::batchtools_slurm,
template = system.file("templates/slurm-simple.tmpl", package = "batchtools"),
workers = 2,
resources = list(
partition = "general",
walltime = 60 * 5,
ncpus = 4
)
)
nsims <- 1E6 # number of simulated patients/profiles
# chunking the nsims
# function taken from https://cran.r-project.org/web/packages/bhmbasket/bhmbasket.pdf (used internally)
# bhmbasket:::chunkVector
chunkVector <- function(x, n_chunks) {
if (n_chunks <= 1) {
chunk_list <- list(x)
} else {
chunk_list <- unname(split(x, cut(seq_along(x), n_chunks, labels = FALSE)))
}
return(chunk_list)
}
set.seed() # seed needs to be set outside the foreach call
plan(list(slurm, callr))
# plan(list(slurm, multisession)) # ran into some issues with loading model object in the worker node
registerDoFuture()
chunk_outer <- chunkVector(seq_len(Ntasks), getDoParWorkers())
sim_results <-
foreach(k = chunk_outer, .combine = c) %dorng% { # uses slurm plan
chunk_inner <- chunkVector(k, getDoParWorkers())
foreach(j = chunk_inner, .combine = c) %dorng% { # uses multisession/callr plan
lapply(j, function(x) {
sim_chunk <- expand.ev(
ID = x,
dose =
amt =
ii = ii,
)
mrgsim(mod, sim_chunk) %>% ..
})
}
}
Additionally, you can wrap the whole foreach(s) code block with future({})
or future_promise({})
and run the code without blocking the main R session, I think it's possible then to send multiple independent nested foreach(s)/simulation setup, but haven't fully tested that yet.
Mahmoud
Hello everyone,
I am setting up a big simulation workflow and I am making use of HPC cluster to submit the jobs. the workflow is a follow:
What I noticed is that with the above steps/workflow I get th following error :
MultisessionFuture (doFuture2-1) failed to receive message results from cluster RichSOCKnode #1 (PID 14433 on localhost ‘localhost’). The reason reported was ‘error reading from connection’. Post-mortem diagnostic: No process exists with this PID, i.e. the localhost worker is no longer alive. The total size of the 4 globals exported is 396.99 KiB. The three largest globals are ‘modList’ (382.41 KiB of class ‘list’), ‘...future.x_ii’ (7.86 KiB of class ‘list’) and ‘makeEventDataset’ (6.18 KiB of class ‘function’) Calls: %dofuture% -> doFuture2
if I move the model code into the innermost foreach loop (Compile the model for each chunk) the workflow works fine or when using future_mrgsim_d (setting nchunk to 1, but maybe that's not an issue ).
any idea for that behavior ?
best,
Mahmoud