mschubert / clustermq

R package to send function calls as jobs on LSF, SGE, Slurm, PBS/Torque, or each via SSH
https://mschubert.github.io/clustermq/
Apache License 2.0
146 stars 27 forks source link

Firewall settings? #215

Closed HenrikBengtsson closed 3 years ago

HenrikBengtsson commented 3 years ago

Hi, I'm trying out clustermq on a Slurm cluster. Using a simple example, the job launches, but then nothing happens.

AFAIU, it is ZeroMQ that fails to set up a working communication with the main R session and the worker(s) launched on the cluster's compute node. BTW, I've verified that ZeroMQ works on the localhost using the rzmq main-worker example from https://cran.r-project.org/web/packages/rzmq/readme/README.html. As soon as I attempt to use the same example running across two hosts, it blocks.

I suspect I need to request to have the ZeroMQ protocol opened up on the cluster for clustermq to be able to run on this cluster. I have near-zero experience with ZeroMQ - is this a matter of opening up TCP ports in the firewall? If so, is there a standard range that ZeroMQ uses?

mschubert commented 3 years ago

As far as network connections are concerned, ZeroMQ behaves just like regular sockets. The node from where you call Q will need to be able to accept incoming connections on Sys.info()["nodename"] on the same port as the clustermq ID (printed when starting this job, between 6000 and 9999).

I've seen in the past that a certain network interface blocks these connections, so it may be possible to solve this by setting the clustermq.host=<network interface> option.

mhesselbarth commented 3 years ago

Can you please explain this a bit further? I think I am seeing a similar problem, but setting the clustermq.host option on my local machine just results in an error.

(Worker log without setting clustermq.host)

> clustermq:::worker("tcp://gl-login1.arc-ts.umich.edu:6134")
2020-11-05 07:48:12.548437 | Master: tcp://gl-login1.arc-ts.umich.edu:6134
2020-11-05 07:48:12.561010 | WORKER_UP to: tcp://gl-login1.arc-ts.umich.edu:6134
Error in clustermq:::worker("tcp://gl-login1.arc-ts.umich.edu:6134") : 
  Timeout reached, terminating
Execution halted

and

the console freezes at Running 1 calculations (0 objs/0 Mb common; 1 calls/chunk) ...

mschubert commented 3 years ago

I assume this resolves the initial question @HenrikBengtsson, so I'm closing this

@mhesselbarth If your problem persists, please open a separate issue with a more detailed description of what you are trying to do + where it fails