Closed ghost closed 3 years ago
Hi! I’m going through and cleaning up old/stale issues on this repo.
Sorry for not responding in a reasonable amount of time!
Feel free to open a new Issue if you are still having this trouble.
In general, the controller ip must be one that is connectable from the engines. It must be an ip, not a hostname (hostnames are for connecting, not binding). If you use --ip=*
for the controller, you may also want to set --location
to a hostname you know is connectable, which is used for connections when the bind ip is ambiguous.
environment
OS: RHEL6.7 conda list: see the appendix env: see appendix slurm: 15.08.11
reproduce the error
It will show
So I tried to start ipcontroller manually
analysis
ref: https://stackoverflow.com/questions/29437565/zmq-error-zmqerror-no-such-device maybe it is because that ip should be an ip address rather than hostname. Then I tried this:
However, when I came back to
ipcluster
command,ipengine
cannot find json files:So I use
inotifywait
to monitor the json files:According to detailed timing and testing, I found when
ipyparallel.controller
is loaded, json files are created. When it is shutdown, json files are deleted.Then I manually started the ipcontroller, then I tried to start ipengines manually:
I do not understand those messages. I am very sure the json files exist and are kept unmodified since creation, according to
inotifywait
records. But the MPI processes cannot find them. It is very interesting.So I tried to use
srun
of slurm rather thanmpiexec
from intel MPI:I think there is something wrong with the MPI launcher implementation of
ipengine
.appendix