jupyter-server / enterprise_gateway

A lightweight, multi-tenant, scalable and secure gateway that enables Jupyter Notebooks to share resources across distributed clusters such as Apache Spark, Kubernetes and others.
https://jupyter-enterprise-gateway.readthedocs.io/en/latest/
Other
615 stars 221 forks source link

Avoid too many retried on kernel shutdown tries #1261

Closed lresende closed 1 year ago

lresende commented 1 year ago

When we try to delete a kernel that does not exist, we should remove the kernel to avoid infinity retries on trying to shutdown the kernel.

[E 2023-02-23 18:28:13.907 EnterpriseGatewayApp] Exception while shutting down kernel: '9a71f533-3998-4cb2-a1bf-2c2f05b33243': '9a71f533-3998-4cb2-a1bf-2c2f05b33243'
    Traceback (most recent call last):
      File "/opt/conda/lib/python3.9/site-packages/enterprise_gateway/services/kernels/remotemanager.py", line 253, in shutdown_kernel
        await super().shutdown_kernel(kernel_id, now, restart)
      File "/opt/conda/lib/python3.9/site-packages/jupyter_server/services/kernels/kernelmanager.py", line 671, in shutdown_kernel
        return await self.pinned_superclass.shutdown_kernel(
      File "/opt/conda/lib/python3.9/site-packages/jupyter_client/multikernelmanager.py", line 509, in shutdown_kernel
        self.remove_kernel(kernel_id)
      File "/opt/conda/lib/python3.9/site-packages/enterprise_gateway/services/kernels/remotemanager.py", line 328, in remove_kernel
        super().remove_kernel(kernel_id)
      File "/opt/conda/lib/python3.9/site-packages/jupyter_client/multikernelmanager.py", line 244, in remove_kernel
        return self._kernels.pop(kernel_id)
    KeyError: '9a71f533-3998-4cb2-a1bf-2c2f05b33243'
[E 2023-02-23 18:28:13.907 EnterpriseGatewayApp] The following exception was encountered while checking the idle duration of kernel 9a71f533-3998-4cb2-a1bf-2c2f05b33243: HTTP 404: Not Found (Kernel does not exist: 9a71f533-3998-4cb2-a1bf-2c2f05b33243)
    Traceback (most recent call last):
      File "/opt/conda/lib/python3.9/site-packages/jupyter_server/services/kernels/kernelmanager.py", line 591, in cull_kernels
        await self.cull_kernel_if_idle(kernel_id)
      File "/opt/conda/lib/python3.9/site-packages/jupyter_server/services/kernels/kernelmanager.py", line 639, in cull_kernel_if_idle
        await ensure_async(self.shutdown_kernel(kernel_id))
      File "/opt/conda/lib/python3.9/site-packages/jupyter_server/utils.py", line 182, in ensure_async
        result = await obj
      File "/opt/conda/lib/python3.9/site-packages/enterprise_gateway/services/kernels/remotemanager.py", line 256, in shutdown_kernel
        raise web.HTTPError(404, "Kernel does not exist: %s" % kernel_id) from None
    tornado.web.HTTPError: HTTP 404: Not Found (Kernel does not exist: 9a71f533-3998-4cb2-a1bf-2c2f05b33243)
[D 2023-02-23 18:28:13.907 EnterpriseGatewayApp] kernel_id=8a3b508a-41bb-4ef4-a9ff-8470a7ed1659, kernel_name=spark_32_python_kubernetes, last_activity=2023-02-23 17:21:05.848081+00:00
[W 2023-02-23 18:28:13.907 EnterpriseGatewayApp] Culling 'starting' kernel 'spark_32_python_kubernetes' (8a3b508a-41bb-4ef4-a9ff-8470a7ed1659) with 0 connections due to 4028 seconds of inactivity.
[D 2023-02-23 18:28:13.908 EnterpriseGatewayApp] Clearing buffer for 8a3b508a-41bb-4ef4-a9ff-8470a7ed1659
[I 2023-02-23 18:28:13.908 EnterpriseGatewayApp] Kernel shutdown: 8a3b508a-41bb-4ef4-a9ff-8470a7ed1659
[D 2023-02-23 18:28:13.908 EnterpriseGatewayApp] ERROR: ECONNREFUSED, no process listening, cannot send signal.
[D 2023-02-23 18:28:13.909 EnterpriseGatewayApp] OSError(ENOTCONN) raised on socket shutdown, listener has likely already exited. Cannot send '{'shutdown': 1}'
[E 2023-02-23 18:28:13.909 EnterpriseGatewayApp] Exception while shutting down kernel: '9a71f533-3998-4cb2-a1bf-2c2f05b33243': '9a71f533-3998-4cb2-a1bf-2c2f05b33243'
    Traceback (most recent call last):
      File "/opt/conda/lib/python3.9/site-packages/enterprise_gateway/services/kernels/remotemanager.py", line 253, in shutdown_kernel
        await super().shutdown_kernel(kernel_id, now, restart)
      File "/opt/conda/lib/python3.9/site-packages/jupyter_server/services/kernels/kernelmanager.py", line 671, in shutdown_kernel
        return await self.pinned_superclass.shutdown_kernel(
      File "/opt/conda/lib/python3.9/site-packages/jupyter_client/multikernelmanager.py", line 509, in shutdown_kernel
        self.remove_kernel(kernel_id)
      File "/opt/conda/lib/python3.9/site-packages/enterprise_gateway/services/kernels/remotemanager.py", line 328, in remove_kernel
        super().remove_kernel(kernel_id)
      File "/opt/conda/lib/python3.9/site-packages/jupyter_client/multikernelmanager.py", line 244, in remove_kernel
        return self._kernels.pop(kernel_id)
    KeyError: '9a71f533-3998-4cb2-a1bf-2c2f05b33243'
[E 2023-02-23 18:28:13.909 EnterpriseGatewayApp] The following exception was encountered while checking the idle duration of kernel 9a71f533-3998-4cb2-a1bf-2c2f05b33243: HTTP 404: Not Found (Kernel does not exist: 9a71f533-3998-4cb2-a1bf-2c2f05b33243)
    Traceback (most recent call last):
      File "/opt/conda/lib/python3.9/site-packages/jupyter_server/services/kernels/kernelmanager.py", line 591, in cull_kernels
        await self.cull_kernel_if_idle(kernel_id)
      File "/opt/conda/lib/python3.9/site-packages/jupyter_server/services/kernels/kernelmanager.py", line 639, in cull_kernel_if_idle
        await ensure_async(self.shutdown_kernel(kernel_id))
      File "/opt/conda/lib/python3.9/site-packages/jupyter_server/utils.py", line 182, in ensure_async
        result = await obj
      File "/opt/conda/lib/python3.9/site-packages/enterprise_gateway/services/kernels/remotemanager.py", line 256, in shutdown_kernel
        raise web.HTTPError(404, "Kernel does not exist: %s" % kernel_id) from None
    tornado.web.HTTPError: HTTP 404: Not Found (Kernel does not exist: 9a71f533-3998-4cb2-a1bf-2c2f05b33243)
kevin-bates commented 1 year ago

Hi @lresende. I suspect this is coming from the culling logic. Were there a set of lines preceding the first error log statement in the above output similar to these (that appear in later):

[D 2023-02-23 18:28:13.907 EnterpriseGatewayApp] kernel_id=8a3b508a-41bb-4ef4-a9ff-8470a7ed1659, kernel_name=spark_32_python_kubernetes, last_activity=2023-02-23 17:21:05.848081+00:00
[W 2023-02-23 18:28:13.907 EnterpriseGatewayApp] Culling 'starting' kernel 'spark_32_python_kubernetes' (8a3b508a-41bb-4ef4-a9ff-8470a7ed1659) with 0 connections due to 4028 seconds of inactivity.
[D 2023-02-23 18:28:13.908 EnterpriseGatewayApp] Clearing buffer for 8a3b508a-41bb-4ef4-a9ff-8470a7ed1659
[I 2023-02-23 18:28:13.908 EnterpriseGatewayApp] Kernel shutdown: 8a3b508a-41bb-4ef4-a9ff-8470a7ed1659
[D 2023-02-23 18:28:13.908 EnterpriseGatewayApp] ERROR: ECONNREFUSED, no process listening, cannot send signal.
[D 2023-02-23 18:28:13.909 EnterpriseGatewayApp] OSError(ENOTCONN) raised on socket shutdown, listener has likely already exited. Cannot send '{'shutdown': 1}'

I agree this should be fixed, it's just a matter of where. If this needs to be addressed in the culling logic (which, I would say, should be sensitive to HTTPError 404 and eat that exception (with a log)), then the issue would need to be transfered to jupyter-server.

We could also tighten up RemoteKernelManager.remove_kernel() so that super.remove_kernel() handles 404 exceptions.

Were you going to look into this?