globus / globus-compute

Globus Compute: High Performance Function Serving for Science
https://www.globus.org/compute
Apache License 2.0
148 stars 47 forks source link

ZMQ assertion error when using pyzmq==19.0.2 #393

Open yadudoc opened 3 years ago

yadudoc commented 3 years ago

I've started two endpoints on Cori, one with a LocalProvider and the other with the SlurmProvider to use the Cori batch compute nodes for function execution. In both cases, the endpoints appear to start without issues, but the one configured to use batch queues failed with the following message:

(funcx_v0.0.6a2_py3.8) yadunand@cori08:~/.funcx/cori_batch> cat interchange.std*
2021-03-09 15:02:53 funcx.sdk.client.FuncXClient:79 [INFO]  [instance:46912593216800] Creating client of type <class 'funcx.sdk.client.FuncXClient'> for service "funcX"
2021-03-09 15:02:54 funcx.sdk.client.FuncXClient:79 [INFO]  [instance:46912593283728] Creating client of type <class 'funcx.sdk.client.FuncXClient'> for service "funcX"
Assertion failed: nbytes == sizeof (dummy) (src/signaler.cpp:391)
Yadu : starting local interchange
Starting local interchange with endpoint id: d26372e5-ed4c-42e5-bb54-aaadcb917d8d
Yadu : started local interchange with ports: 54793. 54361

The same error popped up even after multiple restarts of the Cori_batch endpoint, while oddly, the cori_local endpoint did not have any issues.

This is a ZMQ error, and updating this pyzmq package appears to have fixed it. Another point to note here is that the latest version of pyzmq is 22.0.0 which is not supported by our current requirement string pyzmq>=19.0.0,<20.0.0. If we update the version string now, and test with it, using the latest for 0.0.6 release would be good.

With the updated ZMQ package, the endpoint passes all tests.

yadudoc commented 3 years ago

With gcc==7.5.0 recompiling the pyzmq rather than using the wheel from pypi, seems to fix this issue:

pip install --no-binary :all: --force-reinstall pyzmq

ZhuozhaoLi commented 3 years ago

I am seeing this error again even with the fixes above on river endpoint: gcc==7.4.0 and pip install --no-binary :all: --force-reinstall pyzmq

yadudoc commented 3 years ago

The hope on this issue is that pyzmq will have a release that will address this. Until then we ought to keep this issue open with what is a somewhat iffy solution.