jupyter-server / enterprise_gateway

A lightweight, multi-tenant, scalable and secure gateway that enables Jupyter Notebooks to share resources across distributed clusters such as Apache Spark, Kubernetes and others.
https://jupyter-enterprise-gateway.readthedocs.io/en/latest/
Other
623 stars 222 forks source link

Sometimes cannot connect to kernel through websocket in replication availability mode #1373

Open edwardzjl opened 8 months ago

edwardzjl commented 8 months ago

Description

I deployed enterprise gateway on k8s in replication availability mode (3 replicas) with file session persistence, and ensured that sessionAffinity is on for enterprise gateway service.

I have another app communicating with EG. And that app performs CRUD and establishes connection to kernels throught EG.

The issue I have is, when I perform a GET kernel to EG, EG will automatically load the saved sessions, so whichever pod I was routed to, the state is always up to date. Here is the related part of EG's log:

[D 2024-03-04 08:46:11.428 EnterpriseGatewayApp] Loading saved session(s) from /data/kernel_sessions/573e4281-0442-421b-8a25-760549854902.json
[D 2024-03-04 08:46:11.481 EnterpriseGatewayApp] Connecting to: tcp://$IP:$PORT
[D 2024-03-04 08:46:11.481 EnterpriseGatewayApp] Connecting to: tcp://$IP:$PORT
[I 240304 08:46:11 web:2271] 200 GET /api/kernels/573e4281-0442-421b-8a25-760549854902 ($IP) 88.69ms

However, if I perform a websocket connect, EG does not load the sessions, results in randomly websocket 404 (if I was not routed to the correct pod):

[D 2024-03-04 03:14:42.377 EnterpriseGatewayApp] Initializing websocket connection /api/kernels/6bb19c3b-26f6-4b4d-bfc9-be831a9e648f/channels
[W 2024-03-04 03:14:42.382 EnterpriseGatewayApp] No session ID specified
[W 240304 03:14:42 web:1796] 404 GET /api/kernels/6bb19c3b-26f6-4b4d-bfc9-be831a9e648f/channels ($IP): Kernel does not exist: 6bb19c3b-26f6-4b4d-bfc9-be831a9e648f
[W 240304 03:14:42 web:2271] 404 GET /api/kernels/6bb19c3b-26f6-4b4d-bfc9-be831a9e648f/channels ($IP) 6.49ms

And the same for DELETE kernel:

[D 2024-03-04 09:13:01.411 EnterpriseGatewayApp] activity on d09839bb-10b6-42ca-bff8-6051d046d709: execute_input
[D 2024-03-04 09:13:02.318 EnterpriseGatewayApp] activity on d09839bb-10b6-42ca-bff8-6051d046d709: stream
...
[W 240304 09:13:02 web:1796] 404 DELETE /api/kernels/62eca2ab-6f9e-4471-94bb-7f51e83ecc60 ($IP): Kernel does not exist: 62eca2ab-6f9e-4471-94bb-7f51e83ecc60
[W 240304 09:13:02 web:2271] 404 DELETE /api/kernels/62eca2ab-6f9e-4471-94bb-7f51e83ecc60 ($IP) 907.86ms

So that I think here are 2 problems:

I also noticed that there's a brief note about 'manual reconnect' in the document, but I haven't figured out how to do it in my setup.

I'm new to EG, I'm not sure if it's a real bug or it's by design. But I think if GET loads saved sessions automatically, maybe other endpoints should too.

Context

Troubleshoot Output I added 3 envs and 1 volume to deployment: ```yaml apiVersion: apps/v1 kind: Deployment metadata: name: enterprise-gateway namespace: jupyter spec: replicas: 3 template: spec: containers: - name: enterprise-gateway env: ... - name: EG_AVAILABILITY_MODE value: replication # 2 envs related to session persistence - name: EG_KERNEL_SESSION_PERSISTENCE value: "True" - name: EG_PERSISTENCE_ROOT value: /data volumeMounts: - name: persistence-root mountPath: /data readOnly: false volumes: - name: persistence-root persistentVolumeClaim: claimName: persistence-root ``` Created a pvc to store session data: ```yaml apiVersion: v1 kind: PersistentVolumeClaim metadata: name: persistence-root namespace: jupyter spec: storageClassName: nfs-client accessModes: - ReadWriteMany resources: requests: storage: 1Gi ``` And I checked that sessionAffinity is on in the enterprise-gateway service: ```yaml apiVersion: v1 kind: Service metadata: name: enterprise-gateway namespace: jupyter spec: ... sessionAffinity: ClientIP sessionAffinityConfig: clientIP: timeoutSeconds: 10800 ```