mllg / batchtools

Tools for computation on batch systems
https://mllg.github.io/batchtools/
GNU Lesser General Public License v3.0
171 stars 51 forks source link

getStatus() running jobs in expired #169

Closed smilesun closed 5 years ago

smilesun commented 6 years ago

Hi, I do not quite understand the following code, each time I submit jobs either using slurm or makeClusterFunctionsMulticore() , running jobs are always shown as expired.

getStatusTable = function(ids = NULL, batch.ids = getBatchIds(reg = reg), reg = getDefaultRegistry()) {
  submitted = started = done = error = status = NULL
  stats = merge(filter(reg$status, ids), batch.ids, by = "batch.id", all.x = TRUE, all.y = FALSE, sort = FALSE)[, list(
    defined   = .N,
    submitted = count(submitted),
    started   = sum(!is.na(started) | !is.na(status) & status == "running"),
    done      = count(done),
    error     = count(error),
    queued    = sum(status == "queued", na.rm = TRUE),
    running   = sum(status == "running", na.rm = TRUE),
    expired   = sum(!is.na(submitted) & is.na(done) & is.na(status))
  )]
  stats$done = stats$done - stats$error
  stats$system = stats$queued + stats$running
  return(stats)
}

if I debug this function, and run reg$status, it showed

 job.id def.id  submitted    started       done
1:      1      1 1517414293         NA         NA
2:      2      2 1517414293         NA         NA
3:      3      3 1517414293 1517414293 1517414296
4:      4      4 1517414293         NA         NA
                                                                             error
1:                                                                              NA
2:                                                                              NA
3: Error in doBasicGenDesignChecks(par.set) : \n  Finite box constraints required!
4:                                                                              NA
   mem.used resource.id batch.id log.file                            job.hash
1:       NA           1     8183       NA job12b4402373d2546b8fd1e4a1f03830b0
2:       NA           1     8184       NA job1508e2411e82eb19f412e00f88f1a1c6
3:    167.9           1     8185       NA job2f5b64838863b3effee39ba5fd4575ee
4:       NA           1     8186       NA job876192b6a3c3fb05a04c6cbf607a8357
   job.name repl
1:       NA    1
2:       NA    1
3:       NA    1
4:       NA    1

In the following code, I do not understand where is the status in is.na(status)

expired   = sum(!is.na(submitted) & is.na(done) & is.na(status))
  )]

For the merged table res = merge(filter(reg$status, ids), batch.ids, by = "batch.id", all.x = TRUE, all.y = FALSE, sort = FALSE)

It showed

res[1:4]
   batch.id job.id def.id  submitted    started       done
1:     8183      1      1 1517414293         NA         NA
2:     8184      2      2 1517414293         NA         NA
3:     8185      3      3 1517414293 1517414293 1517414296
4:     8186      4      4 1517414293         NA         NA
                                                                             error
1:                                                                              NA
2:                                                                              NA
3: Error in doBasicGenDesignChecks(par.set) : \n  Finite box constraints required!
4:                                                                              NA
   mem.used resource.id log.file                            job.hash job.name
1:       NA           1       NA job12b4402373d2546b8fd1e4a1f03830b0       NA
2:       NA           1       NA job1508e2411e82eb19f412e00f88f1a1c6       NA
3:    167.9           1       NA job2f5b64838863b3effee39ba5fd4575ee       NA
4:       NA           1       NA job876192b6a3c3fb05a04c6cbf607a8357       NA
   repl status
1:    1     NA
2:    1     NA
3:    1     NA
4:    1     NA
mllg commented 6 years ago

The column status is part of the data.table batch.ids which is returned by getBatchIds(). getBatchIds() internally queries the scheduler for running jobs which in your case seems to be failing. You could try debugging

reg$cluster.functions$listJobsRunning(reg)

which should return process IDs in multicore mode.

The jobs in the last table are detected as expired because:

  1. They have been submitted
  2. They are not done (so they should be queued or running)
  3. Status is NA, i.e. the batch.id has not been returned as the scheduler has been queried for queued and running jobs
tdhock commented 5 years ago

for me the expired jobs seem to happen when there is not enough RAM allocated per job -- increasing the memory resources seems to fix the issue.

mllg commented 5 years ago

for me the expired jobs seem to happen when there is not enough RAM allocated per job -- increasing the memory resources seems to fix the issue.

The reported status then seems correct to me. The jobs started, but did not communicate back any results because they were killed by the scheduler as they consumed too much memory.

A job counts as expired, iff it has been submitted, but it is not terminated (results/errors are written to the file system) AND is not on the system any more.

Also note that jobs may seem expired temporarily. On some systems, under heavy load, it takes more than 60s after a job is terminated before the files written by a job can be found on the master. In these cases, you just need to be patient.

Hope this helps.

tdhock commented 5 years ago

maybe it is a documentation issue. I did not understand what expired means, but your description helps a lot. I would suggest adding some definition of "expired" on https://mllg.github.io/batchtools/articles/batchtools.html which currently does not explain it at all.