mschubert / clustermq

R package to send function calls as jobs on LSF, SGE, Slurm, PBS/Torque, or each via SSH
https://mschubert.github.io/clustermq/
Apache License 2.0
145 stars 26 forks source link

Repeatedly calling Q() in clustermq 0.9.0 leads to "job not found" messages #307

Closed luwidmer closed 7 months ago

luwidmer commented 9 months ago

This minimal example leads to "job not found" errors:

library(clustermq)
fun <- function(x) {x}

fun(1)
Q(fun = fun, x = 1, n_jobs = 1)
Q(fun = fun, x = 1, n_jobs = 1)

While this does not cause errors, there is a lot of additional output produced regarding jobs being terminated or not being found on LSF:

Submitting 1 worker jobs (ID: cmq6452) ...
Running 1 calculations (5 objs/34.2 Kb common; 1 calls/chunk) ...
Master: [1.2 secs 16.1% CPU]; Worker: [avg 49.9% CPU, max 228 Mb]
Submitting 1 worker jobs (ID: cmq7581) ...
Job <7420436[1]> is being terminated
Running 1 calculations (5 objs/34.2 Kb common; 1 calls/chunk) ...
Master: [2.1 secs 4.8% CPU]; Worker: [avg 61.0% CPU, max 228 Mb]                                                      
Job <cmq7581> is not found

Is it possible there is some issue with job cleanup?

mschubert commented 9 months ago

@luwidmer is this in any way changed after fixing https://github.com/mschubert/clustermq/issues/308?

luwidmer commented 9 months ago

@mschubert did the last tests while on vacation, for this one I should be able to get back to you on Oct 18th

mschubert commented 9 months ago

Thanks @luwidmer! I'll have to submit the new version tomorrow to not get archived on CRAN, would be great if we can include this

luwidmer commented 9 months ago

Hi @mschubert - sorry, had some automake issues that I was able to fix - I still get the following messages with the example script below:

Running 100 calculations (5 objs/34.4 Kb common; 1 calls/chunk) ...
Master: [2.5 secs 5.7% CPU]; Worker: [avg 85.5% CPU, max 228 Mb]                                                                                         
Submitting 2 worker jobs (ID: cmq9087) ...
Job <7431990[1-2:1]>: Job has already finished
Running 100 calculations (5 objs/34.4 Kb common; 1 calls/chunk) ...
Master: [3.3 secs 4.6% CPU]; Worker: [avg 91.6% CPU, max 228 Mb]                                                                                         
Submitting 2 worker jobs (ID: cmq7734) ...
Job <7431991[1-2:1]>: Job has already finished
fun <- function(x) {
  if (packageVersion("clustermq") != "0.9.1") {
    stop("Wrong clustermq version")
  }
  return("All good")
}

library(clustermq)
fun <- function(x) {x}

fun(1)
for (i in 1:100) {
  Q(fun = fun, x = 1:100, n_jobs = 2)
}

However, this is not breaking - so submit ahead 👍

luwidmer commented 9 months ago

My guess is the message comes from this line: https://github.com/mschubert/clustermq/blob/a84388ea5279ef2cd26a1d06efa55ad51deba403/R/qsys_lsf.r#L46

mschubert commented 7 months ago

I believe this is fixed in https://github.com/mschubert/clustermq/commit/f9f87eb3ca1789295bbbbcb63a763677fd4268a7, please reopen if not