Kernel launch failure when address is in use

parente commented 8 years ago

Rarely when running the jupyter/kernel_gateway test suite, the tests fail with the following exception due to a race condition between two kernels using the same port:

Traceback (most recent call last):
  File "/opt/python/3.3.5/lib/python3.3/runpy.py", line 160, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/opt/python/3.3.5/lib/python3.3/runpy.py", line 73, in _run_code
    exec(code, run_globals)
  File "/home/travis/virtualenv/python3.3.5/lib/python3.3/site-packages/ipykernel/__main__.py", line 3, in <module>
    app.launch_new_instance()
  File "/home/travis/virtualenv/python3.3.5/lib/python3.3/site-packages/traitlets/config/application.py", line 595, in launch_instance
    app.initialize(argv)
  File "<decorator-gen-122>", line 2, in initialize
  File "/home/travis/virtualenv/python3.3.5/lib/python3.3/site-packages/traitlets/config/application.py", line 74, in catch_config_error
    return method(app, *args, **kwargs)
  File "/home/travis/virtualenv/python3.3.5/lib/python3.3/site-packages/ipykernel/kernelapp.py", line 412, in initialize
    self.init_sockets()
  File "/home/travis/virtualenv/python3.3.5/lib/python3.3/site-packages/ipykernel/kernelapp.py", line 245, in init_sockets
    self.init_iopub(context)
  File "/home/travis/virtualenv/python3.3.5/lib/python3.3/site-packages/ipykernel/kernelapp.py", line 250, in init_iopub
    self.iopub_port = self._bind_socket(self.iopub_socket, self.iopub_port)
  File "/home/travis/virtualenv/python3.3.5/lib/python3.3/site-packages/ipykernel/kernelapp.py", line 174, in _bind_socket
    s.bind("tcp://%s:%i" % (self.ip, port))
  File "zmq/backend/cython/socket.pyx", line 487, in zmq.backend.cython.socket.Socket.bind (zmq/backend/cython/socket.c:5156)
  File "zmq/backend/cython/checkrc.pxd", line 25, in zmq.backend.cython.checkrc._check_rc (zmq/backend/cython/socket.c:7535)
zmq.error.ZMQError: Address already in use

IIRC, the port assignment is requested by a client (e.g., via a connection file). Having the kernel retry on another port means the client won't be able to communicate with that kernel.

Is this behavior spec'ed somewhere? Is it the client's responsibility to retry (e.g., this is why I see the rare "kernel has died, restarting" flow sometimes when first opening a notebook?) Can the behavior be improved?

/cc @minrk

parente commented 8 years ago

This captures the exception @willingc saw in #94 which is about to close I think.

parente commented 8 years ago

Turns out another one was captured in #85. Keeping this one since the trace is cleaner.

minrk commented 8 years ago

The sequence of events is:

KernelManager finds a few available ports
KernelManager writes those ports to connection file
KernelManager starts the Kernel, pointing it to connection info
Kernel starts, reads connection file, binds to those ports

You are right that the kernel can't select a new port, because the manager wouldn't know about it. If the kernel fails to start (for any reason), it can try creating a new one. The 'restarting' logic seen in the notebook does not restart from the beginning (creating new connection info), it restarts the kernel process. This is because it is meant for the "kernel died" situation, caused by segfaults and the like, and it should not break existing connections to restart the kernel. Handling failed-to-start-the-first-time could be dealt with differently, though.

parente commented 8 years ago

Handling failed-to-start-the-first-time could be dealt with differently, though.

This is what I'm after here. I saw this happen today outside of the test cases in the KG logs. It certainly kept trying to restart the kernel again and again, but always using the taken port.

Restarting the process from the top is probably the way to go here. There's always going to be a race condition between finding ports and using them in the kernel process so best we can do is retry with new ones.

poplav commented 8 years ago

From initially tracing the logic of starting to polling to restarting I found https://github.com/jupyter/jupyter_client/blob/master/jupyter_client/manager.py#L335. On the call to shutdown kernel the restart parameter is controlling whether the connection file is cleaned on invocation, where it is not being cleaned on a restart.
Is a valid approach to attempt to override the behavior of restart_kernel with respect to the restart parameter? I am weary because this is 4 or 5 calls out of KG and will require copied code to reach it.

minrk commented 8 years ago

What might be tricky is that we really have to distinguish failure to start the first time and normal restarts. Normal restarts should not change the port info, this should only even be considered when the kernel has failed to start at all.

parente commented 8 years ago

Would be nice if we implement our own subclass of IOLoopKernelRestarter and set it as the _restarter trait in the KernelGatewayIOLoopKernelManager class, if we could override the poll logic to have it distinguish failed to launch at all from failed to restart later. Not sure what to key off of for that.

I think @poplav tried this, but we saw exceptions before the restarter even had a chance to kick in. Calls made during the start_kernel logic were raising exceptions too. Seems like we have to guard against both cases?

rolweber commented 7 years ago

Has anyone given thought to letting the kernels choose ports for themselves, and communicating them back to the kernel manager somehow? Clearly not a short-term solution, I know.

parente commented 7 years ago

We haven't seen the port conflict problem in quite some time now on travis, so I'm going to close this issue. If there's still motivation to change how kernels do port selection, I think an issue needs to be opened against jupyter/jupyter since it requires a change to a de facto standard about how clients like jupyter/notebook bootstrap kernel processes.

rolweber commented 7 years ago

We're seeing the port problems in internal regression tests of the Apache Spark service on Bluemix, but rather for notebook than kernel gateway. I agree that this issue should be addressed in the shared codebase.

lzkelley commented 7 years ago

I'm trying to run a cell in a jupyter notebook being run remotely (using SSH tunneling) which has always worked fine in the past, but today I'm getting the error:

Traceback (most recent call last):
  File "/n/home00/lkelley/.conda/envs/py35/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/n/home00/lkelley/.conda/envs/py35/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/n/home00/lkelley/.conda/envs/py35/lib/python3.6/site-packages/ipykernel_launcher.py", line 16, in <module>
    app.launch_new_instance()
  File "/n/home00/lkelley/.conda/envs/py35/lib/python3.6/site-packages/traitlets/config/application.py", line 657, in launch_instance
    app.initialize(argv)
  File "<decorator-gen-123>", line 2, in initialize
  File "/n/home00/lkelley/.conda/envs/py35/lib/python3.6/site-packages/traitlets/config/application.py", line 87, in catch_config_error
    return method(app, *args, **kwargs)
  File "/n/home00/lkelley/.conda/envs/py35/lib/python3.6/site-packages/ipykernel/kernelapp.py", line 448, in initialize
    self.init_sockets()
  File "/n/home00/lkelley/.conda/envs/py35/lib/python3.6/site-packages/ipykernel/kernelapp.py", line 251, in init_sockets
    self.init_iopub(context)
  File "/n/home00/lkelley/.conda/envs/py35/lib/python3.6/site-packages/ipykernel/kernelapp.py", line 256, in init_iopub
    self.iopub_port = self._bind_socket(self.iopub_socket, self.iopub_port)
  File "/n/home00/lkelley/.conda/envs/py35/lib/python3.6/site-packages/ipykernel/kernelapp.py", line 180, in _bind_socket
    s.bind("tcp://%s:%i" % (self.ip, port))
  File "zmq/backend/cython/socket.pyx", line 495, in zmq.backend.cython.socket.Socket.bind (zmq/backend/cython/socket.c:5653)
  File "zmq/backend/cython/checkrc.pxd", line 25, in zmq.backend.cython.checkrc._check_rc (zmq/backend/cython/socket.c:10014)
zmq.error.ZMQError: Address already in use

jupyter 4.3.0

update: starting a new notebook server and reconnecting fixed the issue (using the same tunnel and ports).

ericdill commented 5 years ago

@parente apologies in advance for commenting on a long-dead thread.

This ZMQ port selection issue repeatably crops up when using the sparkmagic kernel to talk to Livy under the conditions that you restart the kernel while the Spark context is being initialized. I have not dug too deeply because I'm honestly not too sure where to start looking. I believe what is going on with the sparkmagic kernel is that it's in a while loop polling Livy until the spark context is active and I bet that this while loop is not being terminated quickly enough on kernel restart. Do you have any obvious threads I can start pulling on here to attempt to resolve this restarting-the-kernel-triggers-this-zmq-issue?

Stack trace in the Jupyter server logs:

[I 13:38:13.360 LabApp] Kernel restarted: a4a620ee-032a-4251-a1b5-b160a0e6f6e2
Traceback (most recent call last):
  File "/opt/continuum/anaconda/envs/anaconda50_hadoop/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/opt/continuum/anaconda/envs/anaconda50_hadoop/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/opt/continuum/anaconda/envs/anaconda50_hadoop/lib/python3.6/site-packages/sparkmagic/kernels/pysparkkernel/pysparkkernel.py", line 28, in <module>
    IPKernelApp.launch_instance(kernel_class=PySparkKernel)
  File "/opt/continuum/anaconda/envs/anaconda50_hadoop/lib/python3.6/site-packages/traitlets/config/application.py", line 657, in launch_instance
    app.initialize(argv)
  File "<decorator-gen-139>", line 2, in initialize
  File "/opt/continuum/anaconda/envs/anaconda50_hadoop/lib/python3.6/site-packages/traitlets/config/application.py", line 87, in catch_config_error
    return method(app, *args, **kwargs)
  File "/opt/continuum/anaconda/envs/anaconda50_hadoop/lib/python3.6/site-packages/ipykernel/kernelapp.py", line 448, in initialize
    self.init_sockets()
  File "/opt/continuum/anaconda/envs/anaconda50_hadoop/lib/python3.6/site-packages/ipykernel/kernelapp.py", line 238, in init_sockets
    self.shell_port = self._bind_socket(self.shell_socket, self.shell_port)
  File "/opt/continuum/anaconda/envs/anaconda50_hadoop/lib/python3.6/site-packages/ipykernel/kernelapp.py", line 180, in _bind_socket
    s.bind("tcp://%s:%i" % (self.ip, port))
  File "zmq/backend/cython/socket.pyx", line 495, in zmq.backend.cython.socket.Socket.bind (zmq/backend/cython/socket.c:5653)
  File "zmq/backend/cython/checkrc.pxd", line 25, in zmq.backend.cython.checkrc._check_rc (zmq/backend/cython/socket.c:10014)
zmq.error.ZMQError: Address already in use

Sparkmagic while loop on spark context initialization: https://github.com/jupyter-incubator/sparkmagic/blob/master/sparkmagic/sparkmagic/livyclientlib/livysession.py#L231

parente commented 5 years ago

@ericdill I believe @rolweber was the last to take a crack at patching a particular "address in use" issue we were discussing here in https://github.com/jupyter/jupyter_client/pull/279 via https://github.com/jupyter/jupyter_client/issues/276. I believe the design Min outlined in https://github.com/jupyter/kernel_gateway/issues/131#issuecomment-197218444 is still largely intact in the jupyter_client package, however, meaning it's still possible to hit the condition where a kernel manager chooses a set of ports and writes them to a connection file, a kernel process spawns and reads them, the kernel is later restarted, and the new kernel process must use those same ports but finds them still in use by the old kernel process.

jupyter-server / kernel_gateway

Kernel launch failure when address is in use #131