mschubert / clustermq

R package to send function calls as jobs on LSF, SGE, Slurm, PBS/Torque, or each via SSH
https://mschubert.github.io/clustermq/
Apache License 2.0
145 stars 26 forks source link

Error in w$poll() : Unexpected peer disconnect #310

Closed raphaelbetschart closed 9 months ago

raphaelbetschart commented 9 months ago

I'm having some issues when I try to run the example code with multiple n_jobs. I'm using the following code:

library(clustermq)

options(
  clustermq.scheduler = "slurm"
)

fx = function(x) x * 2

# queue the function call on your scheduler
Q(fx, 
  x = 1:1000, 
  n_jobs = 10, 
  log_worker = TRUE, 
  verbose = TRUE)

Whenever the function is close to finishing (~97%), Rstudio crashes and 9 out of the 10 worker logs report the following:

Error in w$poll() : Unexpected peer disconnect
Calls: <Anonymous> -> <Anonymous> -> .External
Execution halted

1 of the 10 workers reports:

2023-10-09 10:27:00.208494 | shutting down worker
2023-10-09 10:27:00.208596 |
Total: 241 in 7.29s [user], 0.09s [system], 7.73s [elapsed]
mschubert commented 9 months ago

Hi @raphaelbetschart, thanks for your report!

Does the issue persist if you use the current git version?

remotes::install_github("mschubert/clustermq@master")
raphaelbetschart commented 9 months ago

Hi @mschubert,

Unfortunately the error still exists after installing the latest git version.

mschubert commented 9 months ago

I fixed another bug in https://github.com/mschubert/clustermq/commit/5612364c52f17ba98b241a3f1f7e067c02bad3fe, which may be the cause of this crash as well. Can you confirm if this now works? (same git install command as above)

raphaelbetschart commented 9 months ago

Perfect, it works now! Many thanks.