Closed HeidiSeibold closed 7 years ago
What cluster functions do you use? Does batchtools:::getBatchIds(reg)
return anything useful?
reg$cluster.functions = makeClusterFunctionsMulticore(ncpus = parallel::detectCores() - 2)
and
> batchtools:::getBatchIds(reg)
Empty data.table (0 rows) of 2 cols: batch.id,status
Any output here? You are running the commands inside the VM, right?
reg$cluster.functions$listJobsRunning(reg)
reg$cluster.functions$listJobsQueued(reg)
reg$status[ijoin(findSubmitted(), findNotDone()), "batch.id"]
> reg$cluster.functions$listJobsRunning(reg)
Error: attempt to apply non-function
> reg$cluster.functions$listJobsQueued(reg)
Error: attempt to apply non-function
> reg$status[ijoin(findSubmitted(), findNotDone()), "batch.id"]
Syncing 209 files ...
Syncing 1 files ...
Empty data.table (0 rows) of 1 col: batch.id
enable debug mode?
reg$cluster.functions$listJobsRunning(reg)
getStatus()
?str(reg$cluster.functions)
report?Furthermore, can you please send verify if there are any batch.ids stored in the database?
unique(reg$status$batch.id)
Can you post the output of getStatus()?
getStatus() Syncing 13 files ... Status for 3900000 jobs: Submitted : 0 ( 0.0%) Queued : 0 ( 0.0%) Started : 44502 ( 1.1%) Running : 0 ( 0.0%) Done : 44502 ( 1.1%) Error : 0 ( 0.0%) Expired : 0 ( 0.0%)
Have you started your jobs with CFMulticore and now trying to monitor with CFInteractive?
I don't know.
Maybe the problem arises because I started the R script that does the work with make (nohup make &
). Could that be the case? The Makefile looks like this:
simulation_palmtree: simulation_all.R \
basis/dgp.R \
basis/methods.R \
basis/evaluation.R
Rscript -e 'library("knitr"); stitch("simulation_all.R")'
Now I am loading the registry from a fresh R session:
library("batchtools")
reg <- loadRegistry("bt_simulation_palmtree/")
setDefaultRegistry(reg)
- What does str(reg$cluster.functions) report?
str(reg$cluster.functions) List of 10 $ name : chr "Interactive" $ submitJob :function (reg, jc) $ killJob : NULL $ listJobsQueued : NULL $ listJobsRunning : NULL $ array.var : chr NA $ store.job : logi FALSE $ scheduler.latency: num 0 $ fs.latency : num NA $ hooks : list() - attr(*, "class")= chr "ClusterFunctions"
But even if I set the cluster.functions to what it was, I get the same problem:
> reg$cluster.functions = makeClusterFunctionsMulticore(ncpus = parallel::detectCores() - 2)
> getStatus()
Syncing 2 files ...
Status for 3900000 jobs:
Submitted : 0 ( 0.0%)
Queued : 0 ( 0.0%)
Started : 45153 ( 1.2%)
Running : 0 ( 0.0%)
Done : 45153 ( 1.2%)
Error : 0 ( 0.0%)
Expired : 0 ( 0.0%)
getStatus() Syncing 13 files ... Status for 3900000 jobs: Submitted : 0 ( 0.0%) Queued : 0 ( 0.0%) Started : 44502 ( 1.1%) Running : 0 ( 0.0%) Done : 44502 ( 1.1%) Error : 0 ( 0.0%) Expired : 0 ( 0.0%)
The data base is not consistent. The started jobs should also be submitted.
I don't know what exactly caused this. This often occurs (a) if you move around the file.dir between systems while jobs are still running or (b) if you have mounted the file system and access the registry on other systems.
Things to consider:
reg$cluster.functions
. Also, detecting running jobs with a different backend is not possible.unique(reg$status$batch.id)
. Because they are recorded in submitJobs
together with the submit times, I guess they are lost for some reason. Good news is that you do not need them anymore after your jobs have terminated (44502 in your case), only to detect what is currently running.reg$status[is.na(submitted) & !is.na(started), submitted := started]
saveRegistry(reg)
Ah ok, I think (b) caused my problem then.
Thanks for the help! :cake:
I'll write down some recommendations in the vignette for the next release... You were not the first one mounting the file.dir :smile:
I run a VM on our Linux cluster and findRunning() does not seem to work since it does not find any running jobs even though I see that the cores are still active and R is running.
getJobStatus shows this:
I am not really sure how I can make a reproducible example for this problem. Let me know if I can help in any way.