0.9.0+ crashing on AWS parallel cluster with slurm & several hundred cores

kkmann commented 9 months ago

Hi,

I want to report an issue with {clustermq} on an AWS parallel cluster when trying to submit several hundred jobs. The initiating R session just crashes.

Up to including 0.8.95 everything works perfectly smooth, from 0.9.0 I start to see this issue.

options(
  clustermq.scheduler = "slurm"
)

f <- function(x) {
  # pretend to do something cool
  Sys.sleep(5)
  return(x)
}

x <- seq(1, 1e5)
clustermq::Q(f, x = x, n_jobs = 3000, job_size = 1)

With higher versions of {clustermq} 10-ish jobs still work fine. How would I best diagnose this further / any pointers as to which changes in 0.9.0 might be the most likely cause for these issues?

mschubert commented 9 months ago

Can you please test the current versions on CRAN (0.9.2) and Github (0.9.2.9000) and see if the issue persists?

There were some problems with 0.9.0 and 0.9.1 (mentioned in NEWS) that are hopefully already resolved.

kkmann commented 9 months ago

It does, all jobs are running on the slurm side (3k) but the R ression is stuck at much less (4xx/3000 wrk) and crashes. This is on clustermq straight from gh main today.

mschubert commented 9 months ago

I've tested this on our SGE and can confirm a problem with 100 or more parallel jobs, with a couple of possible error messages:

res = clustermq::Q(function(x) { Sys.sleep(5); x }, x=1:100, job_size=1)

# Error in m[, field] : incorrect number of dimensions
# In addition: Warning message:In addition: Warning message:
# In (function (..., deparse.level = 1)  :
#   number of columns of result is not a multiple of vector length (arg 1)

# Error in (function (..., deparse.level = 1)  :
#   unimplemented type 'char' in 'eval'
# Error: no more error handlers available (recursive errors?); invoking 'abort' restart

# Error in (function (..., deparse.level = 1)  :
#   unimplemented type 'char' in 'eval'
#  *** caught segfault ***
# address (nil), cause 'unknown'
# Traceback:
#  1: (function (..., deparse.level = 1) .Internal(rbind(deparse.level, ...)))(c(user.self = 1.114, [...]
#  2: do.call(rbind, info$time)
#  3: self$info()
#  4: pool$cleanup()
#  5: master(pool = workers, iter = df, rettype [...]
#  6: Q_rows(fun = fun, df = df,  [...]
#  7: clustermq::Q(function(x) {    Sys.sleep(5)    x}, x = 1:100, job_size = 1)

This is likely caused by insufficient PROTECTion of R C API-created SEXP objects on

the worker between evaluating the requested expression and sending the result (fixed in https://github.com/mschubert/clustermq/commit/16eb03bafc8409cfe963f88a3b50731a815a7c06)
the master when accessing worker-sent times and memory statistics

I've pushed a partial fix and workaround to the issue-324 branch, which fixes the issues I observed on our system for up to 1000 parallel jobs (and likely more).

mschubert commented 9 months ago

This should now be fixed in the current Github master and will be included in the next CRAN release.

Please reopen if this is not solved on your end.

kkmann commented 9 months ago

Hi Michael, thanks for looking into this so quickly - much appreciated! I can confirm that the issue no longer persists with the most current master and the example code form above.

mschubert commented 9 months ago

Great, thanks for confirming!

mschubert / clustermq

0.9.0+ crashing on AWS parallel cluster with slurm & several hundred cores #324