jupyter-server / enterprise_gateway

A lightweight, multi-tenant, scalable and secure gateway that enables Jupyter Notebooks to share resources across distributed clusters such as Apache Spark, Kubernetes and others.
https://jupyter-enterprise-gateway.readthedocs.io/en/latest/
Other
620 stars 223 forks source link

Kubernetes based remote kernel got interrupt / delete lead to Enterprise Gateway server crashing #1136

Closed chiawchen closed 2 years ago

chiawchen commented 2 years ago

Description

Thanks for building this project, it's really helpful and extendable to fit in with our in-house kubernetes infrastructure, we simply provide the customized launcher script and processproxy. Everything works out of the box (we are able to launch a pod running on in-house kubernetes and able to send command to the remote kernel running on other pod in kubernetes) except one thing which is interrupting / deleting the kernel.

image

We observed a crash on enterprise gateway server side, when there's either a interrupt or delete request come from client side and it only happens on kubernetes based remote kernel, the kernel that launched locally within enterprise gateway (i.e. ipykernel_launcher) can be interrupted or deleted successfully without crashing.

image

Reproduce

I'm not sure if this is reproducible in native kubernetes, but I'll try to provide as much details as possible

  1. Start jupyter enterprise-gateway
    root@enterprise-gateway-debug:~# jupyter enterprisegateway --log-level=0
    [D 2022-07-26 21:59:08.335 EnterpriseGatewayApp] Searching ['/root', '/root/.jupyter', '/root/.local/etc/jupyter', '/usr/etc/jupyter', '/usr/local/etc/jupyter', '/etc/jupyter'] for config files
    [D 2022-07-26 21:59:08.335 EnterpriseGatewayApp] Looking for jupyter_config in /etc/jupyter
    [D 2022-07-26 21:59:08.335 EnterpriseGatewayApp] Looking for jupyter_config in /usr/local/etc/jupyter
    [D 2022-07-26 21:59:08.335 EnterpriseGatewayApp] Looking for jupyter_config in /usr/etc/jupyter
    [D 2022-07-26 21:59:08.335 EnterpriseGatewayApp] Looking for jupyter_config in /root/.local/etc/jupyter
    [D 2022-07-26 21:59:08.335 EnterpriseGatewayApp] Looking for jupyter_config in /root/.jupyter
    [D 2022-07-26 21:59:08.335 EnterpriseGatewayApp] Looking for jupyter_config in /root
    [D 2022-07-26 21:59:08.336 EnterpriseGatewayApp] Looking for jupyter_enterprise_gateway_config in /etc/jupyter
    [D 2022-07-26 21:59:08.336 EnterpriseGatewayApp] Looking for jupyter_enterprise_gateway_config in /usr/local/etc/jupyter
    [D 2022-07-26 21:59:08.336 EnterpriseGatewayApp] Looking for jupyter_enterprise_gateway_config in /usr/etc/jupyter
    [D 2022-07-26 21:59:08.336 EnterpriseGatewayApp] Looking for jupyter_enterprise_gateway_config in /root/.local/etc/jupyter
    [D 2022-07-26 21:59:08.336 EnterpriseGatewayApp] Looking for jupyter_enterprise_gateway_config in /root/.jupyter
    [D 2022-07-26 21:59:08.337 EnterpriseGatewayApp] Loaded config file: /root/.jupyter/jupyter_enterprise_gateway_config.py
    [D 2022-07-26 21:59:08.337 EnterpriseGatewayApp] Looking for jupyter_enterprise_gateway_config in /root
    [D 220726 21:59:08 selector_events:53] Using selector: EpollSelector
    [I 2022-07-26 21:59:08.343 EnterpriseGatewayApp] Jupyter Enterprise Gateway 2.6.0 is available at http://0.0.0.0:6006

    and the corresponding ENV

        - name: "EG_IP"
          value: "0.0.0.0"
        - name: "EG_PORT"
          value: "6006"
        - name: "EG_RESPONSE_PORT"
          value: "8877"
        - name: "EG_NAMESPACE"
          value: "jupyter-eval"
        - name: "EG_CULL_IDLE_TIMEOUT"
          value: "3600"
        - name: "EG_LOG_LEVEL"
          value: "DEBUG"
        - name: "EG_UNAUTHORIZED_USERS"
          value: ""
        - name: "EG_KERNEL_LAUNCH_TIMEOUT"
          value: "600"
        - name: "EG_CONNECT_TIMEOUT"
          value: "600"
        - name: "EG_REQUEST_TIMEOUT"
          value: "600"
        - name: "EG_JUPYTER_GATEWAY_CONNECT_TIMEOUT"
          value: "600"
  2. Start up a remote kernel backed by kubernetes
  3. Once it's running, click on interrupt or delete session
  4. Server side logging indicates crash and the process got killed, but the remote kernel host on another kubernetes pod is still running and didn't go through the processproxy kill() properly

Expected behavior

The expected behavior is the kernel been deleted successfully and the jupyter enterprise gateway shouldn't crash if one of the remote kernel been requested to interrupt or delete. Or one step back, server side should provide the traceback / debug logging about the interrupt / delete happened, so server side can do further debugging, perhaps the issue is on customized processproxy but we cannot see the log as for this case.

Context

Command Line Output
[I 220726 21:50:25 web:2275] 200 GET /api/kernels/9d78c60e-b6e0-4481-b98f-e84a1de75f7e (127.0.0.1) 0.60ms
[D 2022-07-26 21:50:25.830 EnterpriseGatewayApp] Clearing buffer for 9d78c60e-b6e0-4481-b98f-e84a1de75f7e
[I 2022-07-26 21:50:25.830 EnterpriseGatewayApp] Kernel shutdown: 9d78c60e-b6e0-4481-b98f-e84a1de75f7e
[I 2022-07-26 21:50:25.830 EnterpriseGatewayApp] Interrupted...
[I 2022-07-26 21:50:25.830 EnterpriseGatewayApp] Jupyter Enterprise Gateway is shutting down all running kernels
Traceback (most recent call last):
  File "/usr/local/bin/jupyter-enterprisegateway", line 8, in 
    sys.exit(launch_instance())
  File "/usr/local/lib/python3.7/dist-packages/jupyter_core/application.py", line 269, in launch_instance
    return super().launch_instance(argv=argv, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/traitlets/config/application.py", line 976, in launch_instance
    app.start()
  File "/usr/local/lib/python3.7/dist-packages/enterprise_gateway/enterprisegatewayapp.py", line 343, in start
    self.shutdown()
  File "/usr/local/lib/python3.7/dist-packages/enterprise_gateway/enterprisegatewayapp.py", line 352, in shutdown
    self.kernel_manager.shutdown_kernel(kid, now=True)
  File "/usr/lib/python3.7/asyncio/base_events.py", line 566, in run_until_complete
    self.run_forever()
  File "/usr/lib/python3.7/asyncio/base_events.py", line 534, in run_forever
    self._run_once()
  File "/usr/lib/python3.7/asyncio/base_events.py", line 1771, in _run_once
    handle._run()
  File "/usr/lib/python3.7/asyncio/events.py", line 88, in _run
    self._context.run(self._callback, *self._args)
  File "/usr/local/lib/python3.7/dist-packages/tornado/web.py", line 2361, in 
    fut.add_done_callback(lambda f: f.result())
  File "/usr/local/lib/python3.7/dist-packages/enterprise_gateway/enterprisegatewayapp.py", line 337, in start
    self.io_loop.start()
  File "/usr/local/lib/python3.7/dist-packages/tornado/platform/asyncio.py", line 215, in start
    self.asyncio_loop.run_forever()
  File "/usr/lib/python3.7/asyncio/base_events.py", line 534, in run_forever
    self._run_once()
  File "/usr/lib/python3.7/asyncio/base_events.py", line 1771, in _run_once
    handle._run()
  File "/usr/lib/python3.7/asyncio/events.py", line 88, in _run
    self._context.run(self._callback, *self._args)
  File "/usr/local/lib/python3.7/dist-packages/tornado/web.py", line 1713, in _execute
    result = await result
  File "/usr/local/lib/python3.7/dist-packages/jupyter_server/services/kernels/handlers.py", line 78, in delete
    await ensure_async(km.shutdown_kernel(kernel_id))
  File "/usr/local/lib/python3.7/dist-packages/jupyter_server/utils.py", line 182, in ensure_async
    result = await obj
  File "/usr/local/lib/python3.7/dist-packages/enterprise_gateway/services/kernels/remotemanager.py", line 217, in shutdown_kernel
    await super().shutdown_kernel(kernel_id, now, restart)
  File "/usr/local/lib/python3.7/dist-packages/jupyter_server/services/kernels/kernelmanager.py", line 669, in shutdown_kernel
    self, kernel_id, now=now, restart=restart
  File "/usr/local/lib/python3.7/dist-packages/jupyter_client/multikernelmanager.py", line 252, in _async_shutdown_kernel
    await ensure_async(km.shutdown_kernel(now, restart))
  File "/usr/local/lib/python3.7/dist-packages/jupyter_client/utils.py", line 33, in ensure_async
    return await obj
  File "/usr/local/lib/python3.7/dist-packages/jupyter_client/manager.py", line 469, in _async_shutdown_kernel
    await ensure_async(self.interrupt_kernel())
  File "/usr/local/lib/python3.7/dist-packages/jupyter_client/utils.py", line 33, in ensure_async
    return await obj
  File "/usr/local/lib/python3.7/dist-packages/jupyter_client/manager.py", line 640, in _async_interrupt_kernel
    await self._async_signal_kernel(signal.SIGINT)
  File "/usr/local/lib/python3.7/dist-packages/jupyter_client/manager.py", line 663, in _async_signal_kernel
    os.killpg(pgid, signum)  # type: ignore
KeyboardInterrupt
Browser Output
[E 2022-07-26 21:50:26.463 ServerApp] {
      "Host": "0.0.0.0:8080",
      "Accept": "*/*",
      "Referer": "http://0.0.0.0:8080/lab/tree/demo-notebook.ipynb",
      "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"
    }
[E 2022-07-26 21:50:26.463 ServerApp] 503 DELETE /api/sessions/4649c802-5999-4299-aebf-a503d89817b5?1658872225719 (172.16.65.108) 677.20ms referer=http://0.0.0.0:8080/lab/tree/demo-notebook.ipynb
welcome[bot] commented 2 years ago

Thank you for opening your first issue in this project! Engagement like this is essential for open source projects! :hugs:
If you haven't done so already, check out Jupyter's Code of Conduct. Also, please try to follow the issue template as it helps other other community members to contribute more effectively. welcome You can meet the other Jovyans by joining our Discourse forum. There is also an intro thread there where you can stop by and say Hi! :wave:
Welcome to the Jupyter community! :tada:

kevin-bates commented 2 years ago

Hi @chiawchen - thank you for the kind words - we really appreciate hearing this! So you actually created your own process proxy? That is really great.

I'm pretty sure what's going on here and actually voiced this concern during the review of changes (sigh). Because of the aliasing approach, subclasses that override local methods will not have those methods called because the scope of the renaming does not apply within the existing methods. This prevents the signal code from getting into the RemoteKernelManager - which forwards to the process proxy. As a result, the process group associated with the EG instance is getting killed rather than that of the process managed by the process proxy.

Given this, I'm hoping you can do two things prior to us moving forward...

  1. Confirm you're using jupyter_client >= 6.1.13, if so, continue to the next step.
  2. Try installing jupyter_client == 6.1.12 where this aliasing does not occur.

Looking at the elyra/enterprise-gateway:2.6.0 image, I see that it includes jupyter_client == 6.1.12. Do you happen to modify that image in any way - perhaps upgrade dependencies for example?

What I'm still wondering though is why we haven't heard about this yet and I can only surmise that folks happen to be running with 6.1.12.

On a side note: I'm not sure if you've taken a look at the kernel provisioners in jupyter_client 7.0 or not, but the integration is much better since provisioners (which originated from process-proxies) are first-class objects. Unfortunately, they are not compatible with process proxies so EG can't leverage them until its 4.0 release when we move to use them and the process proxies will go away. Converting a process proxy to a provisioner is fairly straightforward. If you are interested, some of the EG process proxies exist as provisioners here - since I needed to use them as proof of concept when implementing kernel provisioning. We plan to promote these to full-fledged provisioners once we have time.

chiawchen commented 2 years ago

Thanks @kevin-bates, I've verified that EG runtime is using jupyter-client==6.2.0, and once I downgraded to 6.1.12 as you mentioned, it solved magically! Thanks so much for the help!

image

As for the reason why we are not using jupyter-client==6.1.12 it is because we are building own docker image with our business logic and simply install the jupyter-enterprise-gateway in the dockerfile, where I can have more freedom on the customized source code and deps management since we are building own processproxy & launcher to fit into our in-house kubernetes.

kevin-bates commented 2 years ago

@chiawchen - thank you for the quick update. jupyter_client==6.2.0 was yanked from PyPI, I believe for the same reasons I complained about 6.1.13 (but didn't realize that it too was yanked).

image

Since pip install "jupyter_client~=6.1" installs 6.1.12, and this is what our dependencies state, I'm not sure there's much we can do...

$ pip install "jupyter_client~=6.1"
Collecting jupyter_client~=6.1
  Using cached jupyter_client-6.1.12-py3-none-any.whl (112 kB)
Requirement already satisfied: pyzmq>=13 in /opt/miniconda3/envs/eg-dev/lib/python3.9/site-packages (from jupyter_client~=6.1) (23.2.0)
Requirement already satisfied: tornado>=4.1 in /opt/miniconda3/envs/eg-dev/lib/python3.9/site-packages (from jupyter_client~=6.1) (6.1)
Requirement already satisfied: python-dateutil>=2.1 in /opt/miniconda3/envs/eg-dev/lib/python3.9/site-packages (from jupyter_client~=6.1) (2.8.2)
Requirement already satisfied: traitlets in /opt/miniconda3/envs/eg-dev/lib/python3.9/site-packages (from jupyter_client~=6.1) (5.3.0)
Requirement already satisfied: jupyter-core>=4.6.0 in /opt/miniconda3/envs/eg-dev/lib/python3.9/site-packages (from jupyter_client~=6.1) (4.10.0)
Requirement already satisfied: six>=1.5 in /opt/miniconda3/envs/eg-dev/lib/python3.9/site-packages (from python-dateutil>=2.1->jupyter_client~=6.1) (1.16.0)
Installing collected packages: jupyter_client
Successfully installed jupyter_client-6.1.12

I suppose your jupyter_client is getting installed prior to EG. I suppose we could pin to literally <=6.1.12 although that would merely produce a warning but continue with the newer (yanked) version.

chiawchen commented 2 years ago

yeah, we have jupyter_client been installed as a higher version before installing EG, I just pin the version down to 6.1.12 for now, thanks again for the detailed explaination!

kevin-bates commented 2 years ago

Thanks @chiawchen. I'm curious how you went about installing the higher version of jupyter_client. Given that pip install jupyter_client will install the latest release of version 7.3.4 - which EG cannot support, are you explicitly installing a version of 6 greater than 6.1.12? How did you land on jupyter_client==6.2.0?

chiawchen commented 2 years ago

we have several requirments.txt which contain tons of library used by other code, so it's hard to tell which is source of dependency coming from, and those installation all come before I install jupyter-enterprise-gateway

kevin-bates commented 2 years ago

Since we've specified our dependency on jupyter_client and given yanked versions are not implicitly installable, I'm not certain there's much we can do here other than ensure folks are using the supported version of jupyter_client - closing.