mllg / batchtools

Tools for computation on batch systems
https://mllg.github.io/batchtools/
GNU Lesser General Public License v3.0
172 stars 51 forks source link

significance of default.sleep and its affect on interactive job #107

Closed smilesun closed 7 years ago

smilesun commented 7 years ago

Hi, I ran a project using batchtools on a single cpu and for 1 day some simple jobs did not stop at all and by observing the getJobTable(), I find out that those jobs are just run for 2 seconds and then they generate an error. After looking at the documentation , I found the following

default.sleep = function(i) {
2      5 + 115 * pexp(i - 1, rate = 0.01)
# pexp(q, rate = 1, lower.tail = TRUE, log.p = FALSE), where vector q is the vector of quantiles
#pexp’ gives the distribution function
3   }

My question is, is this default.sleep function creating this behavior? COuld something explain the advantage of doing so?

berndbischl commented 7 years ago

please post getStatus output

smilesun commented 7 years ago
> getStatus()
Status for 10 jobs:
  Submitted : 10 (100.0%)
  Queued    :  0 (  0.0%)
  Started   : 10 (100.0%)
  Running   :  0 (  0.0%)
  Done      :  6 ( 60.0%)
  Error     :  4 ( 40.0%)
  Expired   :  0 (  0.0%)
berndbischl commented 7 years ago

and the error messages please, of the jobs

smilesun commented 7 years ago
pplying algorithm 'default_fda' on problem 'march.2017.task4' ...
Error in requirePackages(package, why = stri_paste("learner", id, sep = " "),  : 
  For learner fdaclassif.np please install the following packages: fda.usc

### [bt 2017-05-03 10:40:28]: Job terminated with an exception [batchtools job.id=7]
### [bt 2017-05-03 10:40:28]: Calculation finished!
smilesun commented 7 years ago

actually I must terminate the interactive R session to see those errors getJobTable()

7:      1 secs       2 secs march.2017.task4 default_fda     fdaclassif.np
 8:      0 secs       2 secs march.2017.task4 default_fda    fdaclassif.glm
 9:      0 secs       2 secs march.2017.task4 default_fda    fdaclassif.knn
10:      0 secs       2 secs march.2017.task4 default_fda fdaclassif.kernel

those jobs did not run, but they just paused their

berndbischl commented 7 years ago

well. why are you then claiming

some simple jobs did not stop at all

all of your jobs have stopped. some with state "done", some with state "error". and you can clearly see the error. you are missing a dependency package for some jobs.

where exactly is the problem? and hint: you should have run more calls of "testJob" locally, to see this package error sooner. it is a very common mistake.

smilesun commented 7 years ago

The point is I have to terminate the R interactive session to see this error log. I am using the snicker for batchtools. Even if I use testJob(), it did not stop, so the user never know what happened there. If you kill them too early, maybe you ceased the process. I will try to reproduce the problem when I have time

berndbischl commented 7 years ago

further info:

the default.sleep function should have nothing to do with what you are asking about. bt resubmits jobs in case of temporary cluster errors. but this is something very rare and also different to what you are experiencing here: in your case it is a normal R exception. in this case bt does not resubmit anything.

also, temporary cluster errors need to be defined in your cluster functions. this is something like "cluster is currently busy, please resubmit job again later". on a well operating cluster this should nearly never happen.

but all of this is just info on the side. what happens in your case is something else and much simpler.

berndbischl commented 7 years ago

The point is I have to terminate the R interactive session to see this error log. I am using the snicker for batchtools. Even if I use testJob(), it did not stop, so the user never know what happened there. If you kill them too early, maybe you ceased the process. I will try to reproduce the problem when I have tim

ok, but this is something else and you did not write about this in your first post. please show this example to janek and me tomorrow. it will be something related just to the snickers server and your config (i guess)

berndbischl commented 7 years ago

and:

Even if I use testJob(), it did not stop

so what does happen? does testjob "block" the process? is there a difference between "external = TRUE / FALSE" for testJob?

(neither should block)

berndbischl commented 7 years ago

we resolved this