jupyter / nb2kg

Other
73 stars 31 forks source link

Culling kernels throws 500 error #31

Closed Gsbreddy closed 5 years ago

Gsbreddy commented 5 years ago

Hi, I am using gateway with setting --MappingKernelManager.cull_idle_timeout=60000 enabled. The kernels are getting culled but in the notebook, I get a pop up saying 500 internal server error from nb2kg once the kernel attached to this culled. Is this expected behaviour?

kevin-bates commented 5 years ago

Hi Sai, No, this is not the expected behavior. Is the popup appearing right at the time of the culling (w/o user action) or on first attempted use after the culling?

I'm unable to reproduce this (using a timeout of 60s and an interval of 10s). Could you see if a shorter timeout still leads to the issue?

Please include the exception/traceback that is produced in the Notebook log and/or any unusual messages at that time in the gateway log.

One of the "problems" about culling is that the client really has no idea it's happened - so silence and a notebook that doesn't respond to cell invocations is the expected behavior. 😞 It would be nice if there was a mechanism by which the culling could be reported to (or detected by) the "client side" such that a "Culled" box appears.

Gsbreddy commented 5 years ago

Its like this:

  1. Open a notebook and launch a kernel in spark python yarn cluster mode
  2. Let the kernel attach to notebook & check by running something in cells.
  3. Then go to yarn resource manager and kill the kernel.(which is done ideally by using culling config)
  4. Then go back to notebook, it shows 500 internal server error. Also in enterprise gateway logs i can find stack trace :

Traceback (most recent call last): File "/opt/python/python35/lib/python3.5/site-packages/tornado/web.py", line 1699, in _execute result = await result File "/opt/python/python35/lib/python3.5/asyncio/futures.py", line 363, in iter return self.result() # May raise too. File "/opt/python/python35/lib/python3.5/asyncio/futures.py", line 274, in result raise self._exception File "/opt/python/python35/lib/python3.5/site-packages/tornado/gen.py", line 736, in run yielded = self.gen.throw(exc_info) # type: ignore File "/opt/python/python35/lib/python3.5/site-packages/notebook/services/kernels/handlers.py", line 241, in get yield super(ZMQChannelsHandler, self).get(kernel_id=kernel_id) File "/opt/python/python35/lib/python3.5/site-packages/tornado/gen.py", line 729, in run value = future.result() File "/opt/python/python35/lib/python3.5/asyncio/futures.py", line 274, in result raise self._exception File "/opt/python/python35/lib/python3.5/site-packages/tornado/gen.py", line 736, in run yielded = self.gen.throw(exc_info) # type: ignore File "/opt/python/python35/lib/python3.5/site-packages/notebook/base/zmqhandlers.py", line 295, in get yield gen.maybe_future(res) File "/opt/python/python35/lib/python3.5/site-packages/tornado/gen.py", line 729, in run value = future.result() File "/opt/python/python35/lib/python3.5/asyncio/futures.py", line 274, in result raise self._exception File "/opt/python/python35/lib/python3.5/site-packages/tornado/gen.py", line 742, in run yielded = self.gen.send(value) File "/opt/python/python35/lib/python3.5/site-packages/notebook/services/kernels/handlers.py", line 223, in pre_get kernel = self.kernel_manager.get_kernel(self.kernel_id) File "/opt/python/python35/lib/python3.5/site-packages/jupyter_client/multikernelmanager.py", line 227, in get_kernel return self._kernels[kernel_id] KeyError: '07a6e27e-d390-459d-a4d6-a61f9ecb4894'

500 GET /api/kernels/07a6e27e-d390-459d-a4d6-a61f9ecb4894/channels 5.49ms


Can we not handle this culling in gateway by any chance so that we can tell client with "Culled box" as you mentioned?

kevin-bates commented 5 years ago

I'd like to see the gateway log, but first let's explore item 3. If you're manually using YARN to terminate the YARN application - this is not simulating culling. This is terminating the kernel w/o the knowledge of jupyter and is likely resulting in jupyter attempting to restart the kernel (which would be visible in the gateway logs). Jupyter now has no idea what happened to its kernel and, yes, would very likely result in error code 500.

The culling feature, on the other hand, stems from a monitor loop within the gateway server that checks the activity status of each kernel. Any kernels that meet the criteria of culling are then gracefully shutdown - which then goes through YARN and you should see the application status of FINISHED (rather than KILLED).

Because the culling of kernels happens in a different "session" (its a system-owned monitor loop), the client would need to detect the kernel is no longer active on the server in order to present a "Culled" box. This hasn't been done yet - either when culling happens across a network via NB2KG/Gateway or in vanilla Notebook. This would likely be a change in Notebook that may need supporting code in a gateway scenario depending on what is implemented.