A lightweight, multi-tenant, scalable and secure gateway that enables Jupyter Notebooks to share resources across distributed clusters such as Apache Spark, Kubernetes and others.
I deployed enterprise gateway on k8s in replication availability mode (3 replicas) with file session persistence, and ensured that sessionAffinity is on for enterprise gateway service.
I have another app communicating with EG. And that app performs CRUD and establishes connection to kernels throught EG.
The issue I have is, when I perform a GET kernel to EG, EG will automatically load the saved sessions, so whichever pod I was routed to, the state is always up to date. Here is the related part of EG's log:
[D 2024-03-04 08:46:11.428 EnterpriseGatewayApp] Loading saved session(s) from /data/kernel_sessions/573e4281-0442-421b-8a25-760549854902.json
[D 2024-03-04 08:46:11.481 EnterpriseGatewayApp] Connecting to: tcp://$IP:$PORT
[D 2024-03-04 08:46:11.481 EnterpriseGatewayApp] Connecting to: tcp://$IP:$PORT
[I 240304 08:46:11 web:2271] 200 GET /api/kernels/573e4281-0442-421b-8a25-760549854902 ($IP) 88.69ms
However, if I perform a websocket connect, EG does not load the sessions, results in randomly websocket 404 (if I was not routed to the correct pod):
[D 2024-03-04 03:14:42.377 EnterpriseGatewayApp] Initializing websocket connection /api/kernels/6bb19c3b-26f6-4b4d-bfc9-be831a9e648f/channels
[W 2024-03-04 03:14:42.382 EnterpriseGatewayApp] No session ID specified
[W 240304 03:14:42 web:1796] 404 GET /api/kernels/6bb19c3b-26f6-4b4d-bfc9-be831a9e648f/channels ($IP): Kernel does not exist: 6bb19c3b-26f6-4b4d-bfc9-be831a9e648f
[W 240304 03:14:42 web:2271] 404 GET /api/kernels/6bb19c3b-26f6-4b4d-bfc9-be831a9e648f/channels ($IP) 6.49ms
And the same for DELETE kernel:
[D 2024-03-04 09:13:01.411 EnterpriseGatewayApp] activity on d09839bb-10b6-42ca-bff8-6051d046d709: execute_input
[D 2024-03-04 09:13:02.318 EnterpriseGatewayApp] activity on d09839bb-10b6-42ca-bff8-6051d046d709: stream
...
[W 240304 09:13:02 web:1796] 404 DELETE /api/kernels/62eca2ab-6f9e-4471-94bb-7f51e83ecc60 ($IP): Kernel does not exist: 62eca2ab-6f9e-4471-94bb-7f51e83ecc60
[W 240304 09:13:02 web:2271] 404 DELETE /api/kernels/62eca2ab-6f9e-4471-94bb-7f51e83ecc60 ($IP) 907.86ms
So that I think here are 2 problems:
The kubernetes's sessionAffinity is somehow not working (which I need to dig deeper if I have time)
However, even if I can make sessionAffinity work in my cluster, there still might be edge cases when sessionAffinity timeouts.
The behaviour of GET methods differs from other methods
I also noticed that there's a brief note about 'manual reconnect' in the document, but I haven't figured out how to do it in my setup.
I'm new to EG, I'm not sure if it's a real bug or it's by design. But I think if GET loads saved sessions automatically, maybe other endpoints should too.
Context
Jupyter Enterprise Gateway version: 3.2.2
Troubleshoot Output
I added 3 envs and 1 volume to deployment:
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: enterprise-gateway
namespace: jupyter
spec:
replicas: 3
template:
spec:
containers:
- name: enterprise-gateway
env:
...
- name: EG_AVAILABILITY_MODE
value: replication
# 2 envs related to session persistence
- name: EG_KERNEL_SESSION_PERSISTENCE
value: "True"
- name: EG_PERSISTENCE_ROOT
value: /data
volumeMounts:
- name: persistence-root
mountPath: /data
readOnly: false
volumes:
- name: persistence-root
persistentVolumeClaim:
claimName: persistence-root
```
Created a pvc to store session data:
```yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: persistence-root
namespace: jupyter
spec:
storageClassName: nfs-client
accessModes:
- ReadWriteMany
resources:
requests:
storage: 1Gi
```
And I checked that sessionAffinity is on in the enterprise-gateway service:
```yaml
apiVersion: v1
kind: Service
metadata:
name: enterprise-gateway
namespace: jupyter
spec:
...
sessionAffinity: ClientIP
sessionAffinityConfig:
clientIP:
timeoutSeconds: 10800
```
Description
I deployed enterprise gateway on k8s in replication availability mode (3 replicas) with file session persistence, and ensured that sessionAffinity is on for enterprise gateway service.
I have another app communicating with EG. And that app performs CRUD and establishes connection to kernels throught EG.
The issue I have is, when I perform a
GET kernel
to EG, EG will automatically load the saved sessions, so whichever pod I was routed to, the state is always up to date. Here is the related part of EG's log:However, if I perform a websocket connect, EG does not load the sessions, results in randomly websocket 404 (if I was not routed to the correct pod):
And the same for
DELETE kernel
:So that I think here are 2 problems:
sessionAffinity
is somehow not working (which I need to dig deeper if I have time)sessionAffinity
work in my cluster, there still might be edge cases whensessionAffinity
timeouts.I also noticed that there's a brief note about 'manual reconnect' in the document, but I haven't figured out how to do it in my setup.
I'm new to EG, I'm not sure if it's a real bug or it's by design. But I think if GET loads saved sessions automatically, maybe other endpoints should too.
Context
Troubleshoot Output
I added 3 envs and 1 volume to deployment: ```yaml apiVersion: apps/v1 kind: Deployment metadata: name: enterprise-gateway namespace: jupyter spec: replicas: 3 template: spec: containers: - name: enterprise-gateway env: ... - name: EG_AVAILABILITY_MODE value: replication # 2 envs related to session persistence - name: EG_KERNEL_SESSION_PERSISTENCE value: "True" - name: EG_PERSISTENCE_ROOT value: /data volumeMounts: - name: persistence-root mountPath: /data readOnly: false volumes: - name: persistence-root persistentVolumeClaim: claimName: persistence-root ``` Created a pvc to store session data: ```yaml apiVersion: v1 kind: PersistentVolumeClaim metadata: name: persistence-root namespace: jupyter spec: storageClassName: nfs-client accessModes: - ReadWriteMany resources: requests: storage: 1Gi ``` And I checked that sessionAffinity is on in the enterprise-gateway service: ```yaml apiVersion: v1 kind: Service metadata: name: enterprise-gateway namespace: jupyter spec: ... sessionAffinity: ClientIP sessionAffinityConfig: clientIP: timeoutSeconds: 10800 ```