Closed wlandau closed 6 years ago
Can you try this with clustermq
only and post the worker log file?
Q(..., template=list(log_file="..."))
options(
clustermq.scheduler = "sge",
clustermq.template = "sge_clustermq.tmpl"
)
library(clustermq)
f <- function(i){
parallel::mclapply(1:4 + i, sqrt, mc.cores = 4)
}
Q(f, 1:8, n_jobs = 8, template = list(log_file = "log.txt"))
#> Submitting 8 worker jobs (ID: 6424) ...
#> Running 8 calculations (1 calls/chunk) ...
#> [===============================>--------------------] 62% (4/4 wrk) eta: 6s
At this point, the work hung, so I sent SIGINT with CTRL-C.
^CError in rz mq::poll.socket(list(private$socket), list("read"), timeout = msec) :
The operation was interrupted by delivery of a signal before any events were available.
Calls: Q ... master -> <Anonymous> -> <Anonymous> -> <Anonymous>
^CExecution halted
Log file:
> clustermq:::worker("tcp://CLUSTER-LOGIN-NODE:6424")
Master: tcp://CLUSTER-LOGIN-NODE:6424
WORKER_UP to: tcp://CLUSTER-LOGIN-NODE:6424
> DO_SETUP (0.000s wait)
token from msg: ubust
> WORKER_STOP (0.000s wait)
shutting down worker
Total: 0 in 0.00s [user], 0.00s [system], 0.01s [elapsed]
Thank you, I could reproduce this now: it was caused by poll.socket()
returning NULL
on non-critical interrupt, which was not handled properly by the worker (from mschubert:rzmq/signal)
Fixed on my end. Thanks very much.
I suspect this is related to #99, but it is an important use case, so I thought I should post something for the record. Feel free to close if you think R-devel already fixed it.
The following little
drake
pipeline sends jobs to an SGE cluster, and each job usesmclapply()
to parallelize its own work. It hangs whenmc.cores
is greater than 1, and it completes normally (and very quickly) whenmc.cores
equals 1. I am using https://github.com/ropensci/drake/commit/c6395ee129112d1bdc71b45d1362d4eb5d13ca86, and https://github.com/mschubert/clustermq/commit/ecfdb9da9870b6434f6bf689da6a3cbb94f38a2f. Other session info is here.The template file makes sure each job gets 4 cores.
We get pretty far along in the workflow, but it hangs before starting
x_8
.qstat
shows that some, but not all, of the workers are still running.