jupyter-server / enterprise_gateway

A lightweight, multi-tenant, scalable and secure gateway that enables Jupyter Notebooks to share resources across distributed clusters such as Apache Spark, Kubernetes and others.
https://jupyter-enterprise-gateway.readthedocs.io/en/latest/
Other
617 stars 222 forks source link

The /api/kernels call shows the kernel in starting state even when the kernel is busy #1138

Open AmitJuneja25 opened 2 years ago

AmitJuneja25 commented 2 years ago

Description

We are using the enterprise gateway to create the kernels for our kubernetes pods which are running inside the GKE cluster. The /api/kernels call gives us the list of all the kernels & the state of those kernels as shown below: image

The execution_state field here shows the state of the kernel as starting even when the kernel is actually busy & due to this we are not able to get the correct state of the kernel which is important for us to know because we are using that information to achieve some important tasks.

This issue gets reproduced rarely but this causes us lot of difficulty whenever that happens

As a solution to this issue we just want the /api/kernels call to return the correct status everytime & not display the execution state of the kernel as starting when the kernel is actually in busy state.

welcome[bot] commented 2 years ago

Thank you for opening your first issue in this project! Engagement like this is essential for open source projects! :hugs:
If you haven't done so already, check out Jupyter's Code of Conduct. Also, please try to follow the issue template as it helps other other community members to contribute more effectively. welcome You can meet the other Jovyans by joining our Discourse forum. There is also an intro thread there where you can stop by and say Hi! :wave:
Welcome to the Jupyter community! :tada:

kevin-bates commented 2 years ago

Hello @AmitJuneja25.

Kernel state "starting" is the initial state and is set by the KernelManager when it initiates the kernel's startup. All other state values are the result of the messages between the kernel and the EG server (or jupyter server when EG is not configured).

The example messages you display above all show the "idle" state, indicating the kernel has started and is waiting for further requests (and may have already processed prior requests).

The execution_state field here shows the state of the kernel as starting even when the kernel is actually busy & due to this we are not able to get the correct state of the kernel which is important for us to know because we are using that information to achieve some important tasks.

One of the relatively recent changes made in both Notebook and JupyterServer (on which EG depends) is the kernel "nudging" logic which ensures the kernel startup has completed and the kernel is ready to receive requests prior to exposing the kernel to the end-user. As a result, I would check the version of Notebook you are running (EG 2.5.2 should be based on Notebook rather than JupyterServer) and would recommend updating the Notebook package to the latest version of the 6.x release to see if this helps.

Relying on the status messages can be a little risky, particularly if you're deriving a state machine from it. For example, some kernels are asynchronous and can return an "idle" status prior to the execution results, and some applications may deem the "idle" response as the end of that interaction when, really, the kernel's async behavior happened to issue the responses "out of order". The Apache Toree Scala kernel is one example.

kevin-bates commented 1 year ago

@AmitJuneja25 - at this point, there's nothing to look at here. Do you know what version of the underlying notebook package you're using? The output of jupyter --version may prove helpful. Thanks.

AmitJuneja25 commented 1 year ago

Hello @kevin-bates we are using the below versions for different packages:

jupyter core : 4.7.1 jupyter-notebook : 6.4.11 qtconsole : not installed ipython : 7.23.1 ipykernel : 5.5.4 jupyter client : 6.1.12 jupyter lab : 3.0.14 nbconvert : 6.5.0 ipywidgets : 7.6.3 nbformat : 5.4.0 traitlets : 5.2.2 notebook : 6.4.11

kevin-bates commented 1 year ago

Thanks for the version info. Again, there isn't much to go on here although it looks like you're using relatively up-to-date versions of things.

Since EG doesn't do anything but report the status, just like notebook and jupyter server, I'm not sure where to even begin looking other than perhaps at the kernel responses themselves.

debashis1982 commented 1 year ago

I don't want to hijack this issue but I am seeing something similar to the original issue. We are using kubernetes to create kernels on.I created a new kernel with a POST call to /api/kernels with a payload that looks like

 { 
   "kernel": 
       {
          "name": "py_3.7"
       }
}

I get a response back with the new kernel id. But when I call GET api/kernels/my-kernel-id the execution_state is stuck at "starting" forever

{
    "id": "my-kernel-id",
    "name": "py_3.7",
    "last_activity": "2022-10-25T13:35:34.275974Z",
    "execution_state": "starting",
    "connections": 0
}

When I check logs of the kernel pod it looks like the pod started fine

kevin-bates commented 1 year ago

Hi @debashis1982 - no worries regarding your "hijack" comment - it's good to have other data points. I have some questions that I'm hoping you can answer.

  1. When you look at the logs you say it looks like the pod started fine. Are you able to send cells (or source code) to the kernel for execution?
  2. Can you please provide the log output - both prior to and after the POST request?
  3. What are you using to communicate with the kernel? EG provides a GatewayClient class that can be used to send code. It also illustrates how to access the websocket - necessary for communicating with the kernel.
  4. What release are you using? (This will be evident within the first page of logging output from the EG pod.)
  5. I would expect the kernel status to transition to busy and then idle when the nudging logic occurs (depending on what release you're using). If an older release, then I suspect the kernel will remain in the "starting" state until the websocket is created.
debashis1982 commented 1 year ago

Thanks for your reply @kevin-bates For logs, I just looked at the kernel pod or container logs and it seemed fine. The gateway logs for a kernel (id dd93ffd6-832b-40df-856f-c1cd5b127112) look like this:

[D 2022-10-25 17:21:18.564 EnterpriseGatewayApp] 6: Waiting to connect to k8s pod in namespace 'namespace'. Name: 'dd93ffd6-832b-40df-856f-c1cd5b127112', Status: 'Running', Pod IP: '192.168.1.1', KernelID: 'dd93ffd6-832b-40df-856f-c1cd5b127112
[D 2022-10-25 17:21:18.582 EnterpriseGatewayApp] Waiting for KernelID 'dd93ffd6-832b-40df-856f-c1cd5b127112' to send connection info from host 'dd93ffd6-832b-40df-856f-c1cd5b127112' - retrying..
[D 2022-10-25 17:21:19.108 EnterpriseGatewayApp] 7: Waiting to connect to k8s pod in namespace 'namespace'. Name: 'dd93ffd6-832b-40df-856f-c1cd5b127112', Status: 'Running', Pod IP: '192.168.1.1', KernelID: 'dd93ffd6-832b-40df-856f-c1cd5b127112'
[D 2022-10-25 17:21:19.125 EnterpriseGatewayApp] Waiting for KernelID 'dd93ffd6-832b-40df-856f-c1cd5b127112' to send connection info from host 'dd93ffd6-832b-40df-856f-c1cd5b127112' - retrying..
[D 2022-10-25 17:21:19.651 EnterpriseGatewayApp] 8: Waiting to connect to k8s pod in namespace 'namespace'. Name: 'dd93ffd6-832b-40df-856f-c1cd5b127112', Status: 'Running', Pod IP: '192.168.1.1', KernelID: 'dd93ffd6-832b-40df-856f-c1cd5b127112'
[D 2022-10-25 17:21:19.858 EnterpriseGatewayApp] Received payload 'XXXXXXXXXXXX='
[D 2022-10-25 17:21:19.859 EnterpriseGatewayApp] Version 1 payload received.
[D 2022-10-25 17:21:19.861 EnterpriseGatewayApp] Decrypted payload '{'shell_port': 50625, 'iopub_port': 53591, 'stdin_port': 54635, 'control_port': 54337, 'hb_port': 54335, 'ip': '0.0.0.0', 'key': '8bb0a43b-d041-43a4-b457-6f9d6c9406bd', 'transport': 'tcp', 'signature_scheme': 'hmac-sha256', 'kernel_name': '', 'pid': 15, 'pgid': 12, 'comm_port': 40115, 'kernel_id': 'dd93ffd6-832b-40df-856f-c1cd5b127112'}'
[D 2022-10-25 17:21:19.861 EnterpriseGatewayApp] Connection info received for kernel 'dd93ffd6-832b-40df-856f-c1cd5b127112': {'shell_port': 50625, 'iopub_port': 53591, 'stdin_port': 54635, 'control_port': 54337, 'hb_port': 54335, 'ip': '0.0.0.0', 'key': '8bb0a43b-d041-43a4-b457-6f9d6c9406bd', 'transport': 'tcp', 'signature_scheme': 'hmac-sha256', 'kernel_name': '', 'pid': 15, 'pgid': 12, 'comm_port': 40115, 'kernel_id': 'dd93ffd6-832b-40df-856f-c1cd5b127112'}
[D 2022-10-25 17:21:20.205 EnterpriseGatewayApp] 9: Waiting to connect to k8s pod in namespace 'namespace'. Name: 'dd93ffd6-832b-40df-856f-c1cd5b127112', Status: 'Running', Pod IP: '192.168.1.1', KernelID: 'dd93ffd6-832b-40df-856f-c1cd5b127112'
[D 2022-10-25 17:21:20.216 EnterpriseGatewayApp] Host assigned to the kernel is: 'dd93ffd6-832b-40df-856f-c1cd5b127112' '192.168.1.1'
[D 2022-10-25 17:21:20.217 EnterpriseGatewayApp] Established gateway communication to: 192.168.1.1:40115 for KernelID 'dd93ffd6-832b-40df-856f-c1cd5b127112'
[D 2022-10-25 17:21:20.220 EnterpriseGatewayApp] Received connection info for KernelID 'dd93ffd6-832b-40df-856f-c1cd5b127112' from host 'dd93ffd6-832b-40df-856f-c1cd5b127112': {'shell_port': 50625, 'iopub_port': 53591, 'stdin_port': 54635, 'control_port': 54337, 'hb_port': 54335, 'ip': '192.168.1.1', 'key': '8bb0a43b-d041-43a4-b457-6f9d6c9406bd', 'transport': 'tcp', 'signature_scheme': 'hmac-sha256', 'kernel_name': '', 'comm_port': 40115, 'kernel_id': 'dd93ffd6-832b-40df-856f-c1cd5b127112'}...
[D 2022-10-25 17:21:20.221 EnterpriseGatewayApp] Connecting to: tcp://192.168.1.1:54337
[D 2022-10-25 17:21:20.232 EnterpriseGatewayApp] Connecting to: tcp://192.168.1.1:53591
[I 2022-10-25 17:21:20.239 EnterpriseGatewayApp] Kernel started: dd93ffd6-832b-40df-856f-c1cd5b127112

I am basically trying to write a javascript that

  1. Creates a kernel
  2. Polls for kernel status to ensure it started
  3. Send code for execution

My intention is to use that script for load testing. The version of EG that I am using seems to be Jupyter Enterprise Gateway 3.0.0.dev0 as evident in the log

[I 2022-10-24 17:55:19.109 EnterpriseGatewayApp] Jupyter Enterprise Gateway 3.0.0.dev0 is available at http://
0.0.0.0:8888

Now if I call GET /api/kernels/dd93ffd6-832b-40df-856f-c1cd5b127112 the response I get is

{
    "id": "dd93ffd6-832b-40df-856f-c1cd5b127112",
    "name": "py_3.7",
    "last_activity": "2022-10-25T17:21:21.297700Z",
    "execution_state": "starting",
    "connections": 0
}

"execution_state": "starting" stays that way forever