jupyter-server / enterprise_gateway

A lightweight, multi-tenant, scalable and secure gateway that enables Jupyter Notebooks to share resources across distributed clusters such as Apache Spark, Kubernetes and others.
https://jupyter-enterprise-gateway.readthedocs.io/en/latest/
Other
615 stars 220 forks source link

Jupyter lab error starting kernel #1351

Closed OrenZ1 closed 6 months ago

OrenZ1 commented 6 months ago

Description

Hello! I am using jupyter enterprise gateway 3.2.2 with jupyter hub 2.1.1 and JupyterLab 3.6.3. I am having a weird problem when starting a kernel on a kubernetes cluster. The images I use for the kernels are heavy, and so it takes long to pull them onto the kernel pods. Therefore, I’ve set the ‘—GatewayClient.request_timeout’ on the JupyterLab to 5 minutes. I’ve also modified the Openshift’s route configuration to have a larger timeout than the default 30 seconds. When I try to launch a new kernel, after approximately 2 minutes, I get the following error message on the JupyterLab: “Error Starting Kernel Invalid response: 503 Service Temporarily Unavailable” This changes the selected kernel on the lab to “no kernel” automatically. There are no additional error messages or error / warning logs displayed in the JupyterLab itself, in the enterprise gateway and in the JupyterHub. All 3 are set to a DEBUG log level. After some additional time, when the kernel pod actually launches, I can choose it from the kernel tab in the “Use Kernel from Other Session” section. I am looking for a way to avoid such error message, and I want to understand where is this error coming from.

Note - I’ve also tried to configure other timeout settings in the JupyterLab, such as ‘—GatewatClient.connect_timeout’, ‘—GatewatClient.response_timeout’, ‘—GatewatClient.gateway_retry_interval_max’. All of these changes led to no change in the behavior I mentioned above.

Reproduce

  1. Deploy Jupyter Enterprise Gateway and JupyterHub on OpenShift.
  2. Configure JupyterHub to deploy JupyterLabs on other pods, which are connected to the Enterprise Gateway.
  3. Try to deploy a remote kernel on another pod, which takes more than 2,3 minutes to successfully run.
  4. See error 'Error Starting Kernel. Invalid response: 503 Service Temporarily Unavailable’.

Expected behavior

I would like this message to not appear if a larger timeout is set, and maybe even a loading indicator for kernels which are not yet ready.

Context

Troubleshoot Output
Paste the output from running `jupyter troubleshoot` from the command line here.
You may want to sanitize the paths in the output.
Command Line Output
Paste the output from your command line running `jupyter lab` here, use `--debug` if possible.
Browser Output
Paste the output from your browser Javascript console here, if applicable.

welcome[bot] commented 6 months ago

Thank you for opening your first issue in this project! Engagement like this is essential for open source projects! :hugs:
If you haven't done so already, check out Jupyter's Code of Conduct. Also, please try to follow the issue template as it helps other other community members to contribute more effectively. welcome You can meet the other Jovyans by joining our Discourse forum. There is also an intro thread there where you can stop by and say Hi! :wave:
Welcome to the Jupyter community! :tada:

kevin-bates commented 6 months ago

Hi @OrenZ1. Is there a reason why you don't deploy the Kernel Image Puller daemonset? This will read the kernel specs and pull images embedded in the specs. Otherwise, you should pre-pull your images manually.

OrenZ1 commented 6 months ago

Hi, thank you for your response! We don’t want to pre-pull images, due to an extremely large amount of images (we also allow users to add additional kernel images whenever they desire). We actually don’t mind the large time it takes for a kernel to launch, it’s only natural due to the constraint I previously mentioned and the size of the images, but we do not want this error message to be shown if possible. On another note, we suspect the issue might be related to the JupyterHub integration with the Lab and the Enterprise Gateway. When we deploy an independent Lab pod, with the same CMD, and same configurations (including to the Gateway), we receive no such error.

lresende commented 6 months ago

Are you using Kubernetes, might be related to ingress timeouts while waiting for the kernel to get started?

OrenZ1 commented 6 months ago

I am using OpenShift (and Kubernetes), but I am not using ingresses, I only create services for the Enterprise Gateway and the JupyterHub (and a route for the JupyterHub itself)

lresende commented 6 months ago

What are you using for the gateway URL? the service URL?

OrenZ1 commented 6 months ago

Yes, the service is of type cluster IP. And I use the URL which is composed of the service, the namespace, and svc.cluster.local

OrenZ1 commented 6 months ago

Fixed it! The problem was with the NodeJS version on the JupyterHub. It was NodeJS v10, which had a 120 seconds default timeout.