Closed parente closed 7 years ago
This captures the exception @willingc saw in #94 which is about to close I think.
Turns out another one was captured in #85. Keeping this one since the trace is cleaner.
The sequence of events is:
You are right that the kernel can't select a new port, because the manager wouldn't know about it. If the kernel fails to start (for any reason), it can try creating a new one. The 'restarting' logic seen in the notebook does not restart from the beginning (creating new connection info), it restarts the kernel process. This is because it is meant for the "kernel died" situation, caused by segfaults and the like, and it should not break existing connections to restart the kernel. Handling failed-to-start-the-first-time could be dealt with differently, though.
Handling failed-to-start-the-first-time could be dealt with differently, though.
This is what I'm after here. I saw this happen today outside of the test cases in the KG logs. It certainly kept trying to restart the kernel again and again, but always using the taken port.
Restarting the process from the top is probably the way to go here. There's always going to be a race condition between finding ports and using them in the kernel process so best we can do is retry with new ones.
From initially tracing the logic of starting to polling to restarting I found https://github.com/jupyter/jupyter_client/blob/master/jupyter_client/manager.py#L335. On the call to shutdown kernel the restart parameter is controlling whether the connection file is cleaned on invocation, where it is not being cleaned on a restart.
Is a valid approach to attempt to override the behavior of restart_kernel with respect to the restart parameter? I am weary because this is 4 or 5 calls out of KG and will require copied code to reach it.
What might be tricky is that we really have to distinguish failure to start the first time and normal restarts. Normal restarts should not change the port info, this should only even be considered when the kernel has failed to start at all.
Would be nice if we implement our own subclass of IOLoopKernelRestarter and set it as the _restarter trait in the KernelGatewayIOLoopKernelManager class, if we could override the poll logic to have it distinguish failed to launch at all from failed to restart later. Not sure what to key off of for that.
I think @poplav tried this, but we saw exceptions before the restarter even had a chance to kick in. Calls made during the start_kernel logic were raising exceptions too. Seems like we have to guard against both cases?
Has anyone given thought to letting the kernels choose ports for themselves, and communicating them back to the kernel manager somehow? Clearly not a short-term solution, I know.
We haven't seen the port conflict problem in quite some time now on travis, so I'm going to close this issue. If there's still motivation to change how kernels do port selection, I think an issue needs to be opened against jupyter/jupyter since it requires a change to a de facto standard about how clients like jupyter/notebook bootstrap kernel processes.
We're seeing the port problems in internal regression tests of the Apache Spark service on Bluemix, but rather for notebook than kernel gateway. I agree that this issue should be addressed in the shared codebase.
I'm trying to run a cell in a jupyter notebook being run remotely (using SSH tunneling) which has always worked fine in the past, but today I'm getting the error:
Traceback (most recent call last):
File "/n/home00/lkelley/.conda/envs/py35/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/n/home00/lkelley/.conda/envs/py35/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/n/home00/lkelley/.conda/envs/py35/lib/python3.6/site-packages/ipykernel_launcher.py", line 16, in <module>
app.launch_new_instance()
File "/n/home00/lkelley/.conda/envs/py35/lib/python3.6/site-packages/traitlets/config/application.py", line 657, in launch_instance
app.initialize(argv)
File "<decorator-gen-123>", line 2, in initialize
File "/n/home00/lkelley/.conda/envs/py35/lib/python3.6/site-packages/traitlets/config/application.py", line 87, in catch_config_error
return method(app, *args, **kwargs)
File "/n/home00/lkelley/.conda/envs/py35/lib/python3.6/site-packages/ipykernel/kernelapp.py", line 448, in initialize
self.init_sockets()
File "/n/home00/lkelley/.conda/envs/py35/lib/python3.6/site-packages/ipykernel/kernelapp.py", line 251, in init_sockets
self.init_iopub(context)
File "/n/home00/lkelley/.conda/envs/py35/lib/python3.6/site-packages/ipykernel/kernelapp.py", line 256, in init_iopub
self.iopub_port = self._bind_socket(self.iopub_socket, self.iopub_port)
File "/n/home00/lkelley/.conda/envs/py35/lib/python3.6/site-packages/ipykernel/kernelapp.py", line 180, in _bind_socket
s.bind("tcp://%s:%i" % (self.ip, port))
File "zmq/backend/cython/socket.pyx", line 495, in zmq.backend.cython.socket.Socket.bind (zmq/backend/cython/socket.c:5653)
File "zmq/backend/cython/checkrc.pxd", line 25, in zmq.backend.cython.checkrc._check_rc (zmq/backend/cython/socket.c:10014)
zmq.error.ZMQError: Address already in use
jupyter 4.3.0
update: starting a new notebook server and reconnecting fixed the issue (using the same tunnel and ports).
@parente apologies in advance for commenting on a long-dead thread.
This ZMQ port selection issue repeatably crops up when using the sparkmagic kernel to talk to Livy under the conditions that you restart the kernel while the Spark context is being initialized. I have not dug too deeply because I'm honestly not too sure where to start looking. I believe what is going on with the sparkmagic kernel is that it's in a while loop polling Livy until the spark context is active and I bet that this while loop is not being terminated quickly enough on kernel restart. Do you have any obvious threads I can start pulling on here to attempt to resolve this restarting-the-kernel-triggers-this-zmq-issue?
Stack trace in the Jupyter server logs:
[I 13:38:13.360 LabApp] Kernel restarted: a4a620ee-032a-4251-a1b5-b160a0e6f6e2
Traceback (most recent call last):
File "/opt/continuum/anaconda/envs/anaconda50_hadoop/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/opt/continuum/anaconda/envs/anaconda50_hadoop/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/opt/continuum/anaconda/envs/anaconda50_hadoop/lib/python3.6/site-packages/sparkmagic/kernels/pysparkkernel/pysparkkernel.py", line 28, in <module>
IPKernelApp.launch_instance(kernel_class=PySparkKernel)
File "/opt/continuum/anaconda/envs/anaconda50_hadoop/lib/python3.6/site-packages/traitlets/config/application.py", line 657, in launch_instance
app.initialize(argv)
File "<decorator-gen-139>", line 2, in initialize
File "/opt/continuum/anaconda/envs/anaconda50_hadoop/lib/python3.6/site-packages/traitlets/config/application.py", line 87, in catch_config_error
return method(app, *args, **kwargs)
File "/opt/continuum/anaconda/envs/anaconda50_hadoop/lib/python3.6/site-packages/ipykernel/kernelapp.py", line 448, in initialize
self.init_sockets()
File "/opt/continuum/anaconda/envs/anaconda50_hadoop/lib/python3.6/site-packages/ipykernel/kernelapp.py", line 238, in init_sockets
self.shell_port = self._bind_socket(self.shell_socket, self.shell_port)
File "/opt/continuum/anaconda/envs/anaconda50_hadoop/lib/python3.6/site-packages/ipykernel/kernelapp.py", line 180, in _bind_socket
s.bind("tcp://%s:%i" % (self.ip, port))
File "zmq/backend/cython/socket.pyx", line 495, in zmq.backend.cython.socket.Socket.bind (zmq/backend/cython/socket.c:5653)
File "zmq/backend/cython/checkrc.pxd", line 25, in zmq.backend.cython.checkrc._check_rc (zmq/backend/cython/socket.c:10014)
zmq.error.ZMQError: Address already in use
Sparkmagic while loop on spark context initialization: https://github.com/jupyter-incubator/sparkmagic/blob/master/sparkmagic/sparkmagic/livyclientlib/livysession.py#L231
@ericdill I believe @rolweber was the last to take a crack at patching a particular "address in use" issue we were discussing here in https://github.com/jupyter/jupyter_client/pull/279 via https://github.com/jupyter/jupyter_client/issues/276. I believe the design Min outlined in https://github.com/jupyter/kernel_gateway/issues/131#issuecomment-197218444 is still largely intact in the jupyter_client
package, however, meaning it's still possible to hit the condition where a kernel manager chooses a set of ports and writes them to a connection file, a kernel process spawns and reads them, the kernel is later restarted, and the new kernel process must use those same ports but finds them still in use by the old kernel process.
Rarely when running the jupyter/kernel_gateway test suite, the tests fail with the following exception due to a race condition between two kernels using the same port:
IIRC, the port assignment is requested by a client (e.g., via a connection file). Having the kernel retry on another port means the client won't be able to communicate with that kernel.
Is this behavior spec'ed somewhere? Is it the client's responsibility to retry (e.g., this is why I see the rare "kernel has died, restarting" flow sometimes when first opening a notebook?) Can the behavior be improved?
/cc @minrk