ipython / ipyparallel

IPython Parallel: Interactive Parallel Computing in Python
https://ipyparallel.readthedocs.io/
Other
2.59k stars 1.01k forks source link

Something wrong with MPI launcher of `ipengine`. #344

Closed ghost closed 3 years ago

ghost commented 6 years ago

environment

OS: RHEL6.7 conda list: see the appendix env: see appendix slurm: 15.08.11

reproduce the error

$ salloc -p general -N2 -n4 # for testing purpose
$ ipcluster start -n $SLURM_NTASKS --engines=MPIEngineSetLauncher --profile=my_cluster --ip=$(hostname -s) --debug -- --sqlitedb

It will show

2018-10-27 20:37:23.071 [IPClusterStart] Starting LocalControllerLauncher: ['/home/dic17007/miniconda3/bin/python', '-m', 'ipyparallel.controller', '--profile-dir', '/home/dic17007/.ipython/profile_my_cluster', '--cluster-id', '', '--log-level=20', '--ip=cn01', '--sqlitedb']
2018-10-27 20:37:23.076 [IPClusterStart] Process '/home/dic17007/miniconda3/bin/python' started: 19969
2018-10-27 20:37:24.051 [IPClusterStart] 2018-10-27 20:37:24.051 [IPControllerApp] Using existing profile dir: '/home/dic17007/.ipython/profile_my_cluster'
2018-10-27 20:37:24.069 [IPClusterStart] 2018-10-27 20:37:24.069 [IPControllerApp] ERROR | Couldn't construct the Controller

So I tried to start ipcontroller manually

$ /home/dic17007/miniconda3/bin/python -m ipyparallel.controller --profile-dir /home/dic17007/.ipython/profile_my_cluster --cluster-id --log-level=20 --ip=cn01 --sqlitedb
# __main__.py: error: argument --cluster-id: expected one argument
# so I tried
$ /home/dic17007/miniconda3/bin/python -m ipyparallel.controller --profile-dir /home/dic17007/.ipython/profile_my_cluster --log-level=20 --ip=cn01 --sqlitedb
2018-10-27 20:47:55.183 [IPControllerApp] Using existing profile dir: '/home/dic17007/.ipython/profile_my_cluster'
2018-10-27 20:47:55.216 [IPControllerApp] ERROR | Couldn't construct the Controller
Traceback (most recent call last):
  File "/home/dic17007/miniconda3/lib/python3.6/site-packages/ipyparallel/apps/ipcontrollerapp.py", line 301, in init_hub
    self.factory.init_hub()
  File "/home/dic17007/miniconda3/lib/python3.6/site-packages/ipyparallel/controller/hub.py", line 291, in init_hub
    q.bind(self.client_url('registration'))
  File "zmq/backend/cython/socket.pyx", line 547, in zmq.backend.cython.socket.Socket.bind
  File "zmq/backend/cython/checkrc.pxd", line 25, in zmq.backend.cython.checkrc._check_rc
zmq.error.ZMQError: No such device

analysis

ref: https://stackoverflow.com/questions/29437565/zmq-error-zmqerror-no-such-device maybe it is because that ip should be an ip address rather than hostname. Then I tried this:

$ /home/dic17007/miniconda3/bin/python -m ipyparallel.controller --profile-dir /home/dic17007/.ipython/profile_my_cluster --log-level=20 --ip=* --sqlitedb
2018-10-27 20:51:16.071 [IPControllerApp] Using existing profile dir: '/home/dic17007/.ipython/profile_my_cluster'
2018-10-27 20:51:16.093 [IPControllerApp] Hub listening on tcp://*:59958 for registration.
2018-10-27 20:51:16.095 [IPControllerApp] Hub using DB backend: 'SQLiteDB'
2018-10-27 20:51:16.369 [IPControllerApp] hub::created hub
2018-10-27 20:51:16.369 [IPControllerApp] writing connection info to /home/dic17007/.ipython/profile_my_cluster/security/ipcontroller-client.json
2018-10-27 20:51:16.371 [IPControllerApp] writing connection info to /home/dic17007/.ipython/profile_my_cluster/security/ipcontroller-engine.json
2018-10-27 20:51:16.372 [IPControllerApp] task::using Python leastload Task scheduler
2018-10-27 20:51:16.372 [IPControllerApp] Heartmonitor started
2018-10-27 20:51:16.382 [IPControllerApp] Creating pid file: /home/dic17007/.ipython/profile_my_cluster/pid/ipcontroller.pid
2018-10-27 20:51:16.407 [scheduler] Scheduler started [leastload]
2018-10-27 20:51:16.409 [IPControllerApp] client::client b'\x00k\x8bEg' requested 'connection_request'
2018-10-27 20:51:16.410 [IPControllerApp] client::client [b'\x00k\x8bEg'] connected
# it works!

However, when I came back to ipcluster command, ipengine cannot find json files:

$ ipcluster start -n $SLURM_NTASKS --debug  --profile=my_cluster --engines=MPIEngineSetLauncher  --cluster-id='hello' --delay=10 --engines=MPIEngineSetLauncher  --ip=* --location=$(hostname -s) -- --sqlitedb
2018-10-27 21:02:48.374 [IPClusterStart] Starting 8 Engines with MPIEngineSetLauncher
2018-10-27 21:02:48.374 [IPClusterStart] Starting MPIEngineSetLauncher: ['mpiexec', '-n', '8', '/home/dic17007/miniconda3/bin/python', '-m', 'ipyparallel.engine', '--mpi', '--profile-dir', '/home/dic17007/.ipython/profile_my_cluster', '--cluster-id', 'hello', '--log-level=20']
2018-10-27 21:02:48.381 [IPClusterStart] Process 'mpiexec' started: 21129
2018-10-27 21:02:49.285 [IPClusterStart] 2018-10-27 21:02:49.285 [IPEngineApp] Using existing profile dir: '/home/dic17007/.ipython/profile_my_cluster'
2018-10-27 21:02:49.292 [IPClusterStart] 2018-10-27 21:02:49.291 [IPEngineApp] Loading url_file '/home/dic17007/.ipython/profile_my_cluster/security/ipcontroller-hello-engine.json'
2018-10-27 21:02:49.304 [IPClusterStart] 2018-10-27 21:02:49.302 [IPEngineApp] Registering with controller at tcp://192.168.100.1:56848
2018-10-27 21:02:49.306 [IPClusterStart] 2018-10-27 21:02:49.305 [IPEngineApp] Using existing profile dir: '/home/dic17007/.ipython/profile_my_cluster'
2018-10-27 21:02:49.306 [IPClusterStart] 2018-10-27 21:02:49.305 [IPEngineApp] Using existing profile dir: '/home/dic17007/.ipython/profile_my_cluster'
2018-10-27 21:02:49.306 [IPClusterStart] 2018-10-27 21:02:49.305 [IPEngineApp] Using existing profile dir: '/home/dic17007/.ipython/profile_my_cluster'
2018-10-27 21:02:49.306 [IPClusterStart] 2018-10-27 21:02:49.305 [IPEngineApp] Using existing profile dir: '/home/dic17007/.ipython/profile_my_cluster'
2018-10-27 21:02:49.307 [IPClusterStart] 2018-10-27 21:02:49.305 [IPEngineApp] Using existing profile dir: '/home/dic17007/.ipython/profile_my_cluster'
2018-10-27 21:02:49.311 [IPClusterStart] 2018-10-27 21:02:49.305 [IPEngineApp] Using existing profile dir: '/home/dic17007/.ipython/profile_my_cluster'
2018-10-27 21:02:49.311 [IPClusterStart] 2018-10-27 21:02:49.310 [IPEngineApp] Loading url_file '/home/dic17007/.ipython/profile_my_cluster/security/ipcontroller-hello-engine.json'
2018-10-27 21:02:49.312 [IPClusterStart] 2018-10-27 21:02:49.310 [IPEngineApp] Loading url_file '/home/dic17007/.ipython/profile_my_cluster/security/ipcontroller-hello-engine.json'
2018-10-27 21:02:49.312 [IPClusterStart] 2018-10-27 21:02:49.310 [IPEngineApp] Loading url_file '/home/dic17007/.ipython/profile_my_cluster/security/ipcontroller-hello-engine.json'
2018-10-27 21:02:49.312 [IPClusterStart] 2018-10-27 21:02:49.310 [IPEngineApp] Loading url_file '/home/dic17007/.ipython/profile_my_cluster/security/ipcontroller-hello-engine.json'
2018-10-27 21:02:49.312 [IPClusterStart] 2018-10-27 21:02:49.310 [IPEngineApp] Loading url_file '/home/dic17007/.ipython/profile_my_cluster/security/ipcontroller-hello-engine.json'
2018-10-27 21:02:49.312 [IPClusterStart] 2018-10-27 21:02:49.310 [IPEngineApp] Loading url_file '/home/dic17007/.ipython/profile_my_cluster/security/ipcontroller-hello-engine.json'
2018-10-27 21:02:49.312 [IPClusterStart] python: error: _get_addr: No such file or directory
2018-10-27 21:02:49.328 [IPClusterStart] python: error: _get_addr: No error
2018-10-27 21:02:49.328 [IPClusterStart] Process 'mpiexec' stopped: {'exit_code': 15, 'pid': 21129}
2018-10-27 21:02:49.328 [IPClusterStart] ERROR |
            Engines shutdown early, they probably failed to connect.

            Check the engine log files for output.

            If your controller and engines are not on the same machine, you probably
            have to instruct the controller to listen on an interface other than localhost.

            You can set this by adding "--ip='*'" to your ControllerLauncher.controller_args.

            Be sure to read our security docs before instructing your controller to listen on
            a public interface.

2018-10-27 21:02:49.328 [IPClusterStart] ERROR | IPython cluster: stopping
2018-10-27 21:02:49.329 [IPClusterStart] 2018-10-27 21:02:49.328 [IPControllerApp] CRITICAL | Received signal 2, shutting down
2018-10-27 21:02:49.329 [IPClusterStart] 2018-10-27 21:02:49.329 [IPControllerApp] CRITICAL | terminating children...
2018-10-27 21:02:49.408 [IPClusterStart] Process '/home/dic17007/miniconda3/bin/python' stopped: {'exit_code': 0, 'pid': 21094}
2018-10-27 21:02:52.330 [IPClusterStart] Removing pid file: /home/dic17007/.ipython/profile_my_cluster/pid/ipcluster-hello.pid

So I use inotifywait to monitor the json files:

$ inotifywait --monitor -r /home/dic17007/.ipython/profile_my_cluster/security/ -e create,delete > json_file_events.log 2>&1 &
$ ipcluster start -n $SLURM_NTASKS --debug  --profile=my_cluster --engines=MPIEngineSetLauncher  --cluster-id='hello' --delay=10 --engines=MPIEngineSetLauncher  --ip=* --location=$(hostname -s) -- --sqlitedb 
# ...
$ cat json_file_events.log
Setting up watches.  Beware: since -r was given, this may take a while!
Watches established.
/home/dic17007/.ipython/profile_my_cluster/security/ CREATE ipcontroller-hello-client.json
/home/dic17007/.ipython/profile_my_cluster/security/ CREATE ipcontroller-hello-engine.json
/home/dic17007/.ipython/profile_my_cluster/security/ DELETE ipcontroller-hello-client.json
/home/dic17007/.ipython/profile_my_cluster/security/ DELETE ipcontroller-hello-engine.json

According to detailed timing and testing, I found when ipyparallel.controller is loaded, json files are created. When it is shutdown, json files are deleted.

Then I manually started the ipcontroller, then I tried to start ipengines manually:

$ mpiexec -n 8 /home/dic17007/miniconda3/bin/python -m ipyparallel.engine --mpi --profile-dir /home/dic17007/.ipython/profile_my_cluster --cluster-id hello --log-level=20
# ...
2018-10-27 22:01:02.210 [IPEngineApp] Registering with controller at tcp://192.168.100.1:56322
python: error: _get_addr: No such file or directory
python: error: _get_addr: No error
Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(799).....: fail failed
MPID_Init(1769)...........: channel initialization failed
MPIDI_CH3_Init(126).......: fail failed
MPID_nem_init_ckpt(836)...: fail failed
MPIDI_CH3I_Seg_commit(422): PMI_Barrier returned -1
In: PMI_Abort(69777679, Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(799).....: fail failed
MPID_Init(1769)...........: channel initialization failed
MPIDI_CH3_Init(126).......: fail failed
MPID_nem_init_ckpt(836)...: fail failed
MPIDI_CH3I_Seg_commit(422): PMI_Barrier returned -1)
# ...

I do not understand those messages. I am very sure the json files exist and are kept unmodified since creation, according to inotifywait records. But the MPI processes cannot find them. It is very interesting.

So I tried to use srun of slurm rather than mpiexec from intel MPI:

$ srun /home/dic17007/miniconda3/bin/python -m ipyparallel.engine --mpi --profile-dir /home/dic17007/.ipython/profile_my_cluster --cluster-id hello --log-level=20 &
# ... everything is OK now.

I think there is something wrong with the MPI launcher implementation of ipengine.

appendix

# packages in environment at /home/dic17007/miniconda3:
#
# Name                    Version                   Build  Channel
ampl-mp                   3.1.0                h26a2512_1    conda-forge
asn1crypto                0.23.0           py36h4639342_0    defaults
attrs                     17.4.0                   py36_0    anaconda
beautifulsoup4            4.6.0                    py36_0    conda-forge
blas                      1.1                    openblas    conda-forge
bleach                    2.1.3                    py36_0    defaults
boost                     1.67.0           py36h3e44d54_0    conda-forge
boost-cpp                 1.67.0               h3a22d5f_0    conda-forge
bzip2                     1.0.6                h470a237_2    conda-forge
ca-certificates           2018.03.07                    0    defaults
Cantera                   2.4.0a2                   <pip>
cerberus                  1.1                      py36_0    conda-forge
certifi                   2018.10.15               py36_0    defaults
cffi                      1.11.2           py36h2825082_0    defaults
chardet                   3.0.4            py36h0f667ec_1    defaults
cloudpickle               0.5.5                      py_0    conda-forge
conda                     4.5.11                   py36_0    defaults
conda-build               3.4.1                    py36_0    defaults
conda-env                 2.6.0                h36134e3_1    defaults
conda-verify              2.0.0            py36h98955d8_0    defaults
cryptography              2.1.4            py36hd09be54_0    defaults
cycler                    0.10.0           py36h93f1223_0    defaults
cython                    0.28.3           py36h14c3975_0    defaults
dbus                      1.13.2               h714fa37_1    defaults
decorator                 4.2.1                    py36_0    defaults
eigen                     3.3.5                hfc679d8_1    conda-forge
entrypoints               0.2.3            py36h1aec115_2    defaults
expat                     2.2.5                he0dffb1_0    defaults
filelock                  2.0.13           py36h646ffb5_0    defaults
flake8                    3.5.0                    py36_1    defaults
fontconfig                2.13.0               h9420a91_0    defaults
freetype                  2.9.1                h8a8886c_1    defaults
glib                      2.56.2               hd408876_0    defaults
glob2                     0.6              py36he249c77_0    defaults
gmp                       6.1.2                h6c8ec71_1    defaults
gst-plugins-base          1.14.0               hbbd80ab_1    defaults
gstreamer                 1.14.0               hb453b48_1    defaults
habanero                  0.6.0                      py_0    conda-forge
html5lib                  1.0.1            py36h2f9c1c0_0    defaults
icu                       58.2                 h9c2bf20_1    defaults
idna                      2.6              py36h82fb2a8_1    defaults
intel-openmp              2018.0.0             hc7b2577_8    defaults
ipopt                     3.12.10         blas_openblash1727795_0  [blas_openblas]  conda-forge
ipykernel                 4.8.2                    py36_0    defaults
ipyparallel               6.2.2                    py36_0    defaults
ipython                   6.2.1            py36h88c514a_1    defaults
ipython_genutils          0.2.0            py36hb52b0d5_0    defaults
ipywidgets                7.2.1                    py36_0    defaults
itchat                    1.3.10                    <pip>
jedi                      0.11.1                   py36_0    defaults
jinja2                    2.10             py36ha16c418_0    defaults
jpeg                      9b                   h024ee3a_2    defaults
jsonschema                2.6.0            py36h006f8b5_0    defaults
jupyter                   1.0.0                    py36_7    defaults
jupyter_client            5.2.3                    py36_0    defaults
jupyter_console           5.2.0            py36he59e554_1    defaults
jupyter_core              4.4.0            py36h7c827e3_0    defaults
kiwisolver                1.0.1            py36hf484d3e_0    defaults
libedit                   3.1                  heed3624_0    defaults
libffi                    3.2.1                hd88cf55_4    defaults
libgcc                    7.2.0                h69d50b8_2    defaults
libgcc-ng                 8.2.0                hdf63c60_1    defaults
libgfortran               3.0.0                         1    defaults
libgfortran-ng            7.2.0                hdf63c60_3    defaults
libopenblas               0.3.3                h5a2b251_3    defaults
libpng                    1.6.35               hbc83047_0    defaults
libsodium                 1.0.16               h1bed415_0    defaults
libstdcxx-ng              8.2.0                hdf63c60_1    defaults
libuuid                   1.0.3                h1bed415_2    defaults
libxcb                    1.13                 h1bed415_1    defaults
libxml2                   2.9.8                hf84eae3_0    defaults
mako                      1.0.7                    py36_0    defaults
markupsafe                1.0              py36hd9260cd_1    defaults
matplotlib                3.0.1            py36h5429711_0    defaults
mccabe                    0.6.1            py36h5ad9710_1    defaults
meson                     0.47.1                   py36_0    defaults
metis                     5.1.0                h470a237_3    conda-forge
mistune                   0.8.3            py36h14c3975_1    defaults
mkl                       2018.0.1             h19d6760_4    defaults
mpi4py                    3.0.0                     <pip>
mumps                     5.0.2           blas_openblash613969f_210  [blas_openblas]  conda-forge
nbconvert                 5.3.1            py36hb41ffb7_0    defaults
nbformat                  4.4.0            py36h31c9010_0    defaults
ncurses                   6.0                  h9df7e31_2    defaults
ninja                     1.8.2            py36h6bb024c_1    defaults
nlopt                     2.4.2                    py36_2    conda-forge
notebook                  5.4.1                    py36_0    defaults
numpy                     1.15.3           py36h99e49ec_0    defaults
numpy-base                1.15.3           py36h2f8d375_0    defaults
openblas                  0.2.20                        7    conda-forge
openssl                   1.0.2p               h14c3975_0    defaults
orcid                     0.7.0                      py_1    conda-forge
pagmo                     2.9                  hfc679d8_0    conda-forge
pandas                    0.23.4           py36h04863e7_0    defaults
pandoc                    1.19.2.1             hea2e7c5_1    defaults
pandocfilters             1.4.2            py36ha6701b7_1    defaults
parso                     0.1.1            py36h35f843b_0    defaults
patchelf                  0.9                  hf79760b_2    defaults
pcre                      8.42                 h439df22_0    defaults
pexpect                   4.3.1                    py36_0    defaults
pickleshare               0.7.4            py36h63277f8_0    defaults
Pillow                    5.1.0                     <pip>
pint                      0.8.1                    py36_0    conda-forge
pip                       9.0.1            py36h6c6f9ce_4    defaults
pkginfo                   1.4.1            py36h215d178_1    defaults
platypus-opt              1.0.3                      py_0    conda-forge
pluggy                    0.6.0            py36hb689045_0    anaconda
ply                       3.11                     py36_0    anaconda
prompt_toolkit            1.0.15           py36h17d85b1_0    defaults
ptyprocess                0.5.2            py36h69acd42_0    defaults
py                        1.5.2            py36h29bf505_0    anaconda
pycodestyle               2.3.1            py36hf609f19_0    defaults
pycosat                   0.6.3            py36h0a5515d_0    defaults
pycparser                 2.18             py36hf9f622e_1    defaults
pyflakes                  1.6.0            py36h7bd6a15_0    defaults
pygments                  2.2.0            py36h0d3125c_0    defaults
pygmo                     2.9              py36h5c5fb89_0    conda-forge
pyked                     0.3.0              pyha03cce0_0    pr-omethe-us
pyodbc                    4.0.22           py36hf484d3e_0    defaults
pyopenssl                 17.5.0           py36h20ba746_0    defaults
pyparsing                 2.2.0            py36hee85983_1    defaults
pypng                     0.0.18                    <pip>
PyQRCode                  1.2.1                     <pip>
pyqt                      5.9.2            py36h05f1152_2    defaults
pysocks                   1.6.7            py36hd97a5b1_1    defaults
pytest                    3.4.0                    py36_0    anaconda
python                    3.6.3                h6c0c0dc_5    defaults
python-constraint         1.3.1                    py36_0    conda-forge
python-dateutil           2.6.1                    py36_0    conda-forge
pytz                      2017.3                     py_2    conda-forge
pyyaml                    3.12                     py36_1    conda-forge
pyzmq                     17.0.0           py36h14c3975_0    defaults
qt                        5.9.6                h52aff34_0    defaults
qtconsole                 4.3.1            py36h8f73b5b_0    defaults
readline                  7.0                  ha6073c6_4    defaults
requests                  2.18.4           py36he2e5f8d_1    defaults
ruamel_yaml               0.11.14          py36ha2fb22d_2    defaults
scikit-learn              0.19.2          py36_blas_openblasha84fab4_201  [blas_openblas]  conda-forge
scipy                     1.1.0            py36he2b7bc3_1    defaults
scons                     3.0.1                    py36_1    defaults
scotch                    6.0.5                h9c0d707_1    conda-forge
send2trash                1.5.0                    py36_0    defaults
setuptools                36.5.0           py36he42e2e1_0    defaults
simplegeneric             0.8.1                    py36_2    defaults
simplejson                3.11.1                   py36_0    conda-forge
sip                       4.19.8           py36hf484d3e_0    defaults
six                       1.11.0           py36h372c433_1    defaults
sqlite                    3.23.1               he433501_0    defaults
stopit                    1.1.2                      py_0    conda-forge
terminado                 0.8.1                    py36_1    defaults
testpath                  0.3.1            py36h8cadb63_0    defaults
tk                        8.6.8                hbc83047_0    defaults
tornado                   4.5.3                    py36_0    defaults
traitlets                 4.3.2            py36h674d592_0    defaults
uncertainties             3.0.2                    py36_1    conda-forge
unixodbc                  2.3.4                h1bed415_2    defaults
urllib3                   1.22             py36hbe7ace6_0    defaults
wcwidth                   0.1.7            py36hdf4376a_0    defaults
webencodings              0.5.1            py36h800622e_1    defaults
wheel                     0.30.0           py36hfd4bba0_1    defaults
widgetsnbextension        3.2.1                    py36_0    defaults
xz                        5.2.3                h55aa19d_2    defaults
yaml                      0.1.7                had09818_2    defaults
zeromq                    4.2.5                h439df22_0    defaults
zlib                      1.2.11               ha838bed_2    defaults
minrk commented 3 years ago

Hi! I’m going through and cleaning up old/stale issues on this repo.

Sorry for not responding in a reasonable amount of time!

Feel free to open a new Issue if you are still having this trouble.

In general, the controller ip must be one that is connectable from the engines. It must be an ip, not a hostname (hostnames are for connecting, not binding). If you use --ip=* for the controller, you may also want to set --location to a hostname you know is connectable, which is used for connections when the bind ip is ambiguous.