mllg / batchtools

Tools for computation on batch systems
https://mllg.github.io/batchtools/
GNU Lesser General Public License v3.0
172 stars 51 forks source link

findRunning() returns empty data frame even top shows that things are still running #130

Closed HeidiSeibold closed 7 years ago

HeidiSeibold commented 7 years ago

I run a VM on our Linux cluster and findRunning() does not seem to work since it does not find any running jobs even though I see that the cores are still active and R is running.

getJobStatus shows this:

> a <- getJobStatus()
> summary(a)
     job.id          submitted          started                   
 Min.   :      1   Min.   :NA        Min.   :2017-07-19 13:46:10  
 1st Qu.: 975001   1st Qu.:NA        1st Qu.:2017-07-19 14:15:02  
 Median :1950000   Median :NA        Median :2017-07-19 14:45:28  
 Mean   :1950000   Mean   :NA        Mean   :2017-07-19 18:53:30  
 3rd Qu.:2925000   3rd Qu.:NA        3rd Qu.:2017-07-19 15:15:57  
 Max.   :3900000   Max.   :NA        Max.   :2017-07-26 04:48:53  
                   NA's   :3900000   NA's   :3855722              
      done                        error               memory       
 Min.   :2017-07-19 13:46:10   Length:3900000     Min.   : NA      
 1st Qu.:2017-07-19 14:15:02   Class :character   1st Qu.: NA      
 Median :2017-07-19 14:45:28   Mode  :character   Median : NA      
 Mean   :2017-07-19 19:01:40                      Mean   :NaN      
 3rd Qu.:2017-07-19 15:15:58                      3rd Qu.: NA      
 Max.   :2017-07-26 13:33:15                      Max.   : NA      
 NA's   :3855722                                  NA's   :3900000  
   batch.id           log.file           job.hash              repl      
 Length:3900000     Length:3900000     Length:3900000     Min.   :  1.0  
 Class :character   Class :character   Class :character   1st Qu.: 38.0  
 Mode  :character   Mode  :character   Mode  :character   Median : 75.5  
                                                          Mean   : 75.5  
                                                          3rd Qu.:113.0  
                                                          Max.   :150.0  

 time.queued       time.running     
 Length:3900000    Length:3900000   
 Class :difftime   Class :difftime  
 Mode  :numeric    Mode  :numeric  

I am not really sure how I can make a reproducible example for this problem. Let me know if I can help in any way.

> sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.2 LTS

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=de_CH.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=de_CH.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=de_CH.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=de_CH.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] grid      stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] stima_1.1         rpart_4.1-11      DynTxRegime_2.1   modelObj_3.0     
 [5] palmtree_0.0-0    partykit_1.2-0    mvtnorm_1.0-6     libcoin_0.9-3    
 [9] batchtools_0.9.3  data.table_1.10.4

loaded via a namespace (and not attached):
 [1] Formula_1.2-2     magrittr_1.5      splines_3.4.1     progress_1.1.2   
 [5] rappdirs_0.3.1    lattice_0.20-35   R6_2.2.2          brew_1.0-6       
 [9] tools_3.4.1       checkmate_1.8.3   base64url_1.2     survival_2.41-3  
[13] assertthat_0.2.0  digest_0.6.12     Matrix_1.2-10     stringi_1.1.5    
[17] compiler_3.4.1    backports_1.1.0   prettyunits_1.0.2
mllg commented 7 years ago

What cluster functions do you use? Does batchtools:::getBatchIds(reg) return anything useful?

HeidiSeibold commented 7 years ago

reg$cluster.functions = makeClusterFunctionsMulticore(ncpus = parallel::detectCores() - 2)

and

> batchtools:::getBatchIds(reg)
Empty data.table (0 rows) of 2 cols: batch.id,status
mllg commented 7 years ago

Any output here? You are running the commands inside the VM, right?

reg$cluster.functions$listJobsRunning(reg)
reg$cluster.functions$listJobsQueued(reg)
reg$status[ijoin(findSubmitted(), findNotDone()), "batch.id"]
HeidiSeibold commented 7 years ago
> reg$cluster.functions$listJobsRunning(reg)
Error: attempt to apply non-function
> reg$cluster.functions$listJobsQueued(reg)
Error: attempt to apply non-function
> reg$status[ijoin(findSubmitted(), findNotDone()), "batch.id"]
Syncing 209 files ...
Syncing 1 files ...
Empty data.table (0 rows) of 1 col: batch.id
berndbischl commented 7 years ago

enable debug mode?

mllg commented 7 years ago

reg$cluster.functions$listJobsRunning(reg)

  1. This function should be defined if you've set up your cluster functions successfully.
  2. Can you post the output of getStatus()?
  3. Have you started your jobs with CFMulticore and now trying to monitor with CFInteractive?
  4. What does str(reg$cluster.functions) report?

Furthermore, can you please send verify if there are any batch.ids stored in the database?

unique(reg$status$batch.id)
HeidiSeibold commented 7 years ago
  1. Can you post the output of getStatus()?

    getStatus()
    Syncing 13 files ...
    Status for 3900000 jobs:
    Submitted :       0 (  0.0%)
    Queued    :       0 (  0.0%)
    Started   :   44502 (  1.1%)
    Running   :       0 (  0.0%)
    Done      :   44502 (  1.1%)
    Error     :       0 (  0.0%)
    Expired   :       0 (  0.0%)
  2. Have you started your jobs with CFMulticore and now trying to monitor with CFInteractive?

I don't know. Maybe the problem arises because I started the R script that does the work with make (nohup make &). Could that be the case? The Makefile looks like this:

simulation_palmtree: simulation_all.R \
basis/dgp.R \
basis/methods.R \
basis/evaluation.R
        Rscript -e 'library("knitr"); stitch("simulation_all.R")'

Now I am loading the registry from a fresh R session:

library("batchtools")
reg <- loadRegistry("bt_simulation_palmtree/")
setDefaultRegistry(reg)
  1. What does str(reg$cluster.functions) report?
    str(reg$cluster.functions)
    List of 10
    $ name             : chr "Interactive"
    $ submitJob        :function (reg, jc)  
    $ killJob          : NULL
    $ listJobsQueued   : NULL
    $ listJobsRunning  : NULL
    $ array.var        : chr NA
    $ store.job        : logi FALSE
    $ scheduler.latency: num 0
    $ fs.latency       : num NA
    $ hooks            : list()
    - attr(*, "class")= chr "ClusterFunctions"
HeidiSeibold commented 7 years ago

But even if I set the cluster.functions to what it was, I get the same problem:

> reg$cluster.functions = makeClusterFunctionsMulticore(ncpus = parallel::detectCores() - 2)
> getStatus()
Syncing 2 files ...
Status for 3900000 jobs:
  Submitted :       0 (  0.0%)
  Queued    :       0 (  0.0%)
  Started   :   45153 (  1.2%)
  Running   :       0 (  0.0%)
  Done      :   45153 (  1.2%)
  Error     :       0 (  0.0%)
  Expired   :       0 (  0.0%)
mllg commented 7 years ago

getStatus() Syncing 13 files ... Status for 3900000 jobs: Submitted : 0 ( 0.0%) Queued : 0 ( 0.0%) Started : 44502 ( 1.1%) Running : 0 ( 0.0%) Done : 44502 ( 1.1%) Error : 0 ( 0.0%) Expired : 0 ( 0.0%)

The data base is not consistent. The started jobs should also be submitted.

I don't know what exactly caused this. This often occurs (a) if you move around the file.dir between systems while jobs are still running or (b) if you have mounted the file system and access the registry on other systems.

Things to consider:

  1. Always use the same cluster function implementation per system. Mixing them may lead to inconsistencies. Probably just set them up in your config file and don't touch them via reg$cluster.functions. Also, detecting running jobs with a different backend is not possible.
  2. Be careful while copying files, and do not copy back-and-forth. If possible, define experiments on the remote cluster system and periodically rsync to your local system for analysis. Do not mount the file system and load the registry locally.
  3. I'm still not sure if your data base holds batch ids, have a look at unique(reg$status$batch.id). Because they are recorded in submitJobs together with the submit times, I guess they are lost for some reason. Good news is that you do not need them anymore after your jobs have terminated (44502 in your case), only to detect what is currently running.
  4. To repair the inconsistency with missing submit time you can just do this:
    reg$status[is.na(submitted) & !is.na(started), submitted := started]
    saveRegistry(reg)
HeidiSeibold commented 7 years ago

Ah ok, I think (b) caused my problem then.

Thanks for the help! :cake:

mllg commented 7 years ago

I'll write down some recommendations in the vignette for the next release... You were not the first one mounting the file.dir :smile: