jupyter-server / enterprise_gateway

A lightweight, multi-tenant, scalable and secure gateway that enables Jupyter Notebooks to share resources across distributed clusters such as Apache Spark, Kubernetes and others.
https://jupyter-enterprise-gateway.readthedocs.io/en/latest/
Other
617 stars 222 forks source link

EG continually receives empty payloads from launcher after initial connection #1157

Open blair-anson opened 1 year ago

blair-anson commented 1 year ago

This may be my misunderstanding, so let me outline what I am seeing.

I have EG setup on an EKS kubernetes cluster. I have mounting of a user's NFS folder working, and user impersonation appears to be working as well. However the default folder, and the "home" folder for the kernel is set as /home/jovyan According to the documentation, if I set EG_MIRROR_WORKING_DIRS and KERNEL_WORKING_DIR then the kernel working directory will change, and I assume I will see /home/user1 set as the kernel home folder but I don't see that change applied.

I run a local jupyter instance with the environment variables set like so...

EG_IMPERSONATION_ENABLED=true \
EG_MIRROR_WORKING_DIRS=true \
KERNEL_UID=1001 \
KERNEL_GID=1016 \
KERNEL_USERNAME=user1 \
KERNEL_WORKING_DIR=/home/user1 \
KERNEL_VOLUMES="[{name: 'nfs-volume', nfs:{server: 'xxxxxx', path: '/user1'}}]" \
KERNEL_VOLUME_MOUNTS="[{name: 'nfs-volume', mountPath: '/home/user1'}]" \
jupyter lab --gateway-url=https://zzzzzzz:8888 --GatewayClient.http_user=guest --GatewayClient.http_pwd=yyyyyyy

But when I run in the notebook pwd or ls ~ I can see the home folder is /home/jovyan and not /home/user1.

The jupyter logs look like this...

[I 2022-09-17 06:24:16.487 ServerApp] Default kernel name on Gateway server (u1_python_kubernetes) differs from Notebook server (python3).  Updating to Gateway server's value.
[I 2022-09-17 06:24:18.907 LabApp] Build is up to date
[W 2022-09-17 06:24:22.546 ServerApp] Notebook test.ipynb is not trusted
[I 2022-09-17 06:24:24.165 ServerApp] Request start kernel: kernel_id=None, path=''
[I 2022-09-17 06:24:27.017 ServerApp] GatewayKernelManager started kernel: 6e3c36bb-09cd-4e4f-b067-39f489e83262, args: {'kernel_id': None, 'kernel_name': 'aiq_python_kubernetes', 'cwd': '/home/user1'}
[I 2022-09-17 06:24:27.638 ServerApp] Connecting to wss://zzzzzzz:8888/api/kernels/6e3c36bb-09cd-4e4f-b067-39f489e83262/channels
[I 2022-09-17 06:24:27.945 ServerApp] Connecting to wss://zzzzzzz:8888/api/kernels/6e3c36bb-09cd-4e4f-b067-39f489e83262/channels
[I 2022-09-17 06:24:28.091 ServerApp] Connecting to wss://zzzzzzz:8888/api/kernels/6e3c36bb-09cd-4e4f-b067-39f489e83262/channels
[I 2022-09-17 06:25:31.744 ServerApp] Creating new notebook in 
[I 2022-09-17 06:25:33.270 ServerApp] Request start kernel: kernel_id=None, path=''
[I 2022-09-17 06:25:36.081 ServerApp] GatewayKernelManager started kernel: 3455011b-2493-4c53-9758-9feeba00432a, args: {'kernel_id': None, 'kernel_name': 'u1_python_kubernetes', 'cwd': '/home/user1'}

The enterprise-gateway pod logs have a number of these errors...

[D 2022-09-17 06:49:40.501 EnterpriseGatewayApp] Received payload ''
[D 2022-09-17 06:49:40.501 EnterpriseGatewayApp] decode_payload exception - JSONDecodeError: Expecting value: line 1 column 1 (char 0)
[D 2022-09-17 06:49:40.502 EnterpriseGatewayApp] Received the following exception detecting legacy kernel response - JSONDecodeError: Expecting value: line 1 column 1 (char 0)
[E 2022-09-17 06:49:40.502 EnterpriseGatewayApp] Failure occurred processing connection: Expecting value: line 1 column 1 (char 0)
welcome[bot] commented 1 year ago

Thank you for opening your first issue in this project! Engagement like this is essential for open source projects! :hugs:
If you haven't done so already, check out Jupyter's Code of Conduct. Also, please try to follow the issue template as it helps other other community members to contribute more effectively. welcome You can meet the other Jovyans by joining our Discourse forum. There is also an intro thread there where you can stop by and say Hi! :wave:
Welcome to the Jupyter community! :tada:

kevin-bates commented 1 year ago

Hi @blair-anson - thanks for opening this issue.

The EG_-prefixed envs that appear to be set on your client (notebook) server apply to the EG instance itself and need to be set within that process. If using helm, these can be added into the env stanza of the helm chart (flip the boolean of mirrorWorkingDirs in values.yaml). The KERNEL_-prefixed envs will flow to the EG and be available in the kernel process.

The JSON decoding issues may be a versioning thing where the kernel images are relative to a different version than EG. I suspect EG is older than the kernel images in this case.

What version of EG are you running (this should be present in the EG logs, within the first page of output)? What kernel images (with tags) are being used?

blair-anson commented 1 year ago

@kevin-bates thank you for the prompt response, it is greatly appreciated.

The EG-prefixed envs that appear to be set on your client (notebook) server apply to the EG instance itself and need to be set within that process. If using helm, these can be added into the env stanza of the helm chart (flip the boolean of mirrorWorkingDirs in values.yaml). The KERNEL-prefixed envs will flow to the EG and be available in the kernel process.

Ahh I see. I made the change and the working directory is now working as expected. Thanks!

The JSON decoding issues may be a versioning thing where the kernel images are relative to a different version than EG. I suspect EG is older than the kernel images in this case.

What version of EG are you running (this should be present in the EG logs, within the first page of output)? What kernel images (with tags) are being used?

Jupyter Enterprise Gateway 2.6.0 is what the logs show. The kernel image is custom, it closely resembles the kernel-python image but is based on the latest version of jupyter/minimal-notebook and uses the same kernelspec as python_kubernetes with just the docker image and location of the launch script changed. I tried removing the custom kernel and only having the python_kubernetes kernel, but I see the same issue. The logs are below. I also recently setup karpenter to handle kubernetes autoscaling, do you think it might have something to do with the json issue?

Starting Jupyter Enterprise Gateway...
[D 2022-09-18 05:33:01.972 EnterpriseGatewayApp] Searching ['/usr/local/bin', '/home/jovyan/.jupyter', '/home/jovyan/.local/etc/jupyter', '/opt/conda/etc/jupyter', '/usr/local/etc/jupyter', '/etc/jupyter'] for config files
[D 2022-09-18 05:33:01.973 EnterpriseGatewayApp] Looking for jupyter_config in /etc/jupyter
[D 2022-09-18 05:33:01.973 EnterpriseGatewayApp] Looking for jupyter_config in /usr/local/etc/jupyter
[D 2022-09-18 05:33:01.974 EnterpriseGatewayApp] Looking for jupyter_config in /opt/conda/etc/jupyter
[D 2022-09-18 05:33:01.974 EnterpriseGatewayApp] Looking for jupyter_config in /home/jovyan/.local/etc/jupyter
[D 2022-09-18 05:33:01.974 EnterpriseGatewayApp] Looking for jupyter_config in /home/jovyan/.jupyter
[D 2022-09-18 05:33:01.974 EnterpriseGatewayApp] Looking for jupyter_config in /usr/local/bin
[D 2022-09-18 05:33:01.975 EnterpriseGatewayApp] Looking for jupyter_enterprise_gateway_config in /etc/jupyter
[D 2022-09-18 05:33:01.975 EnterpriseGatewayApp] Looking for jupyter_enterprise_gateway_config in /usr/local/etc/jupyter
[D 2022-09-18 05:33:01.975 EnterpriseGatewayApp] Looking for jupyter_enterprise_gateway_config in /opt/conda/etc/jupyter
[D 2022-09-18 05:33:01.975 EnterpriseGatewayApp] Looking for jupyter_enterprise_gateway_config in /home/jovyan/.local/etc/jupyter
[D 2022-09-18 05:33:01.976 EnterpriseGatewayApp] Looking for jupyter_enterprise_gateway_config in /home/jovyan/.jupyter
[D 2022-09-18 05:33:01.976 EnterpriseGatewayApp] Looking for jupyter_enterprise_gateway_config in /usr/local/bin
[D 220918 05:33:01 selector_events:59] Using selector: EpollSelector
[I 2022-09-18 05:33:01.984 EnterpriseGatewayApp] Jupyter Enterprise Gateway 2.6.0 is available at http://0.0.0.0:8888
[W 220918 05:34:08 web:1787] 404 DELETE /api/kernels/bf7dcc02-ff01-487f-94f2-462a5b7b662f (10.1.1.82): Kernel does not exist: bf7dcc02-ff01-487f-94f2-462a5b7b662f
[W 220918 05:34:08 web:2239] 404 DELETE /api/kernels/bf7dcc02-ff01-487f-94f2-462a5b7b662f (10.1.1.82) 1.37ms
[I 220918 05:34:30 web:2239] 200 GET /api/kernels (10.1.1.82) 0.95ms
[D 2022-09-18 05:34:45.156 EnterpriseGatewayApp] Found kernel python3 in /opt/conda/share/jupyter/kernels
[D 2022-09-18 05:34:45.156 EnterpriseGatewayApp] Found kernel python_kubernetes in /usr/local/share/jupyter/kernels
[I 220918 05:34:45 web:2239] 200 GET /api/kernelspecs (10.1.1.82) 317.87ms
[D 2022-09-18 05:34:56.638 EnterpriseGatewayApp] Searching kernels for user 'user1' 
[D 2022-09-18 05:34:56.639 EnterpriseGatewayApp] Found kernel python3 in /opt/conda/share/jupyter/kernels
[D 2022-09-18 05:34:56.639 EnterpriseGatewayApp] Found kernel python_kubernetes in /usr/local/share/jupyter/kernels
[I 220918 05:34:56 web:2239] 200 GET /api/kernelspecs?user=user1 (10.1.0.169) 3.74ms
[I 220918 05:34:56 web:2239] 200 GET /api/kernels (10.1.2.91) 1.00ms
[D 2022-09-18 05:34:57.447 EnterpriseGatewayApp] Searching kernels for user 'user1' 
[D 2022-09-18 05:34:57.447 EnterpriseGatewayApp] Found kernel python3 in /opt/conda/share/jupyter/kernels
[D 2022-09-18 05:34:57.447 EnterpriseGatewayApp] Found kernel python_kubernetes in /usr/local/share/jupyter/kernels
[I 220918 05:34:57 web:2239] 200 GET /api/kernelspecs?user=user1 (10.1.1.82) 1.38ms
[D 2022-09-18 05:34:58.686 EnterpriseGatewayApp] RemoteMappingKernelManager.start_kernel: python_kubernetes, kernel_username: user1
[D 2022-09-18 05:34:58.696 EnterpriseGatewayApp] Instantiating kernel 'Python on Kubernetes' with process proxy: enterprise_gateway.services.processproxies.k8s.KubernetesProcessProxy
[I 2022-09-18 05:34:59.200 EnterpriseGatewayApp] Enterprise Gateway is bound to port 8877 for remote kernel connection information.
[D 2022-09-18 05:34:59.201 EnterpriseGatewayApp] Starting kernel (async): ['/opt/conda/bin/python3.9', '/usr/local/share/jupyter/kernels/python_kubernetes/scripts/launch_kubernetes.py', '--RemoteProcessProxy.kernel-id', '90e018c1-1469-4b49-bb05-b0fe039511d5', '--RemoteProcessProxy.port-range', '0..0', '--RemoteProcessProxy.response-address', '10.1.2.77:8877', '--RemoteProcessProxy.public-key', 'MIGfMA0GCSqGSIb3DQEBAQUAA4GNADCBiQKBgQDDckhBDJAPcEexUye0Dkhs3v+r1MP2c7cXZ0Sz2rgTwBIfCURFxJYnVhWJCZgSFBs0UnwF21Q2uHLdJkez8ki9xlyRvqX/dNmk+oK485ZaXHpCXVoKPJGtHsEu16hTAeLo01VVw5xZjF18Ti69L4XcVhlmp+oaOoYox/IXFxu1jQIDAQAB']
[D 2022-09-18 05:34:59.201 EnterpriseGatewayApp] Launching kernel: 'Python on Kubernetes' with command: ['/opt/conda/bin/python3.9', '/usr/local/share/jupyter/kernels/python_kubernetes/scripts/launch_kubernetes.py', '--RemoteProcessProxy.kernel-id', '90e018c1-1469-4b49-bb05-b0fe039511d5', '--RemoteProcessProxy.port-range', '0..0', '--RemoteProcessProxy.response-address', '10.1.2.77:8877', '--RemoteProcessProxy.public-key', 'MIGfMA0GCSqGSIb3DQEBAQUAA4GNADCBiQKBgQDDckhBDJAPcEexUye0Dkhs3v+r1MP2c7cXZ0Sz2rgTwBIfCURFxJYnVhWJCZgSFBs0UnwF21Q2uHLdJkez8ki9xlyRvqX/dNmk+oK485ZaXHpCXVoKPJGtHsEu16hTAeLo01VVw5xZjF18Ti69L4XcVhlmp+oaOoYox/IXFxu1jQIDAQAB']
[I 2022-09-18 05:34:59.234 EnterpriseGatewayApp] Created kernel namespace: user1-90e018c1-1469-4b49-bb05-b0fe039511d5
[I 2022-09-18 05:34:59.271 EnterpriseGatewayApp] Created kernel role-binding 'kernel-controller' in namespace: user1-90e018c1-1469-4b49-bb05-b0fe039511d5 for service account: default
[D 2022-09-18 05:34:59.271 EnterpriseGatewayApp] BaseProcessProxy.launch_process() env: {'SHELL': '/bin/bash', 'KUBERNETES_SERVICE_PORT_HTTPS': '443', 'EG_MIRROR_WORKING_DIRS': 'True', 'KUBERNETES_SERVICE_PORT': '443', 'ENTERPRISE_GATEWAY_PORT_8877_TCP': 'tcp://172.20.140.105:8877', 'EG_NAMESPACE': 'enterprise-gateway', 'ENTERPRISE_GATEWAY_SERVICE_PORT_HTTP': '8888', 'HOSTNAME': 'enterprise-gateway-7446484b77-pcqq2', 'LANGUAGE': 'en_US.UTF-8', 'EG_SHARED_NAMESPACE': 'False', 'EG_PORT': '8888', 'EG_LOG_LEVEL': 'DEBUG', 'JAVA_HOME': '/usr/lib/jvm/java-8-openjdk-amd64', 'ENTERPRISE_GATEWAY_SERVICE_PORT_RESPONSE': '8877', 'EG_KERNEL_WHITELIST': '"python_kubernetes"', 'NB_UID': '1000', 'ENTERPRISE_GATEWAY_SERVICE_HOST': '172.20.140.105', 'AWS_DEFAULT_REGION': 'us-west-2', 'AWS_REGION': 'us-west-2', 'PWD': '/usr/local/bin', 'ENTERPRISE_GATEWAY_PORT_8877_TCP_PROTO': 'tcp', 'EG_CULL_IDLE_TIMEOUT': '3600', 'AWS_ROLE_ARN': 'arn:aws:iam::382095222186:role/enterprise-gateway-sa_dev-gw4.somewhere.com', 'EG_DEFAULT_KERNEL_NAME': 'python_kubernetes', 'ENTERPRISE_GATEWAY_PORT_8888_TCP_PORT': '8888', 'EG_ENABLE_TUNNELING': 'False', 'ENTERPRISE_GATEWAY_PORT_8888_TCP_ADDR': '172.20.140.105', 'EG_KERNEL_LAUNCH_TIMEOUT': '60', 'HOME': '/home/jovyan', 'LANG': 'en_US.UTF-8', 'KUBERNETES_PORT_443_TCP': 'tcp://172.20.0.1:443', 'ENTERPRISE_GATEWAY_PORT_8877_TCP_PORT': '8877', 'EG_LIST_KERNELS': 'True', 'EG_SSH_PORT': '2122', 'NB_GID': '100', 'EG_RESPONSE_PORT': '8877', 'ENTERPRISE_GATEWAY_PORT_8888_TCP': 'tcp://172.20.140.105:8888', 'KG_PORT': '8888', 'EG_CULL_CONNECTED': 'False', 'EG_PORT_RETRIES': '0', 'KG_IP': '0.0.0.0', 'ENTERPRISE_GATEWAY_PORT_8877_TCP_ADDR': '172.20.140.105', 'EG_CULL_INTERVAL': '60', 'EG_IP': '0.0.0.0', 'SHLVL': '1', 'AWS_WEB_IDENTITY_TOKEN_FILE': '/var/run/secrets/eks.amazonaws.com/serviceaccount/token', 'CONDA_DIR': '/opt/conda', 'ENTERPRISE_GATEWAY_SERVICE_PORT': '8888', 'SPARK_HOME': '/opt/spark', 'KUBERNETES_PORT_443_TCP_PROTO': 'tcp', 'KG_PORT_RETRIES': '0', 'KUBERNETES_PORT_443_TCP_ADDR': '172.20.0.1', 'SPARK_VER': '2.4.6', 'ENTERPRISE_GATEWAY_PORT': 'tcp://172.20.140.105:8888', 'AWS_STS_REGIONAL_ENDPOINTS': 'regional', 'NB_USER': 'jovyan', 'KUBERNETES_SERVICE_HOST': '172.20.0.1', 'ENTERPRISE_GATEWAY_PORT_8888_TCP_PROTO': 'tcp', 'LC_ALL': 'en_US.UTF-8', 'KUBERNETES_PORT': 'tcp://172.20.0.1:443', 'KUBERNETES_PORT_443_TCP_PORT': '443', 'PATH': '/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/conda/bin', 'EG_KERNEL_CLUSTER_ROLE': 'kernel-controller', 'DEBIAN_FRONTEND': 'noninteractive', 'KERNEL_VOLUME_MOUNTS': "[{name: 'nfs-volume', mountPath: '/home/user1'}]", 'KERNEL_USERNAME': 'user1', 'KERNEL_UID': '1001', 'KERNEL_VOLUMES': "[{name: 'nfs-volume', nfs:{server: 'fs-xxxxxx.efs.us-west-2.amazonaws.com', path: '/user1'}}]", 'KERNEL_WORKING_DIR': '/home/user1', 'KERNEL_GID': '1016', 'KERNEL_LAUNCH_TIMEOUT': '40', 'KERNEL_GATEWAY': '1', 'KERNEL_POD_NAME': 'user1-90e018c1-1469-4b49-bb05-b0fe039511d5', 'KERNEL_SERVICE_ACCOUNT_NAME': 'default', 'KERNEL_NAMESPACE': 'user1-90e018c1-1469-4b49-bb05-b0fe039511d5', 'KERNEL_IMAGE': 'elyra/kernel-py:2.6.0', 'KERNEL_EXECUTOR_IMAGE': 'elyra/kernel-py:2.6.0', 'EG_MIN_PORT_RANGE_SIZE': '1000', 'EG_MAX_PORT_RANGE_RETRIES': '5', 'KERNEL_ID': '90e018c1-1469-4b49-bb05-b0fe039511d5', 'KERNEL_LANGUAGE': 'python', 'EG_IMPERSONATION_ENABLED': 'False'}
[I 2022-09-18 05:34:59.280 EnterpriseGatewayApp] KubernetesProcessProxy: kernel launched. Kernel image: elyra/kernel-py:2.6.0, KernelID: 90e018c1-1469-4b49-bb05-b0fe039511d5, cmd: '['/opt/conda/bin/python3.9', '/usr/local/share/jupyter/kernels/python_kubernetes/scripts/launch_kubernetes.py', '--RemoteProcessProxy.kernel-id', '90e018c1-1469-4b49-bb05-b0fe039511d5', '--RemoteProcessProxy.port-range', '0..0', '--RemoteProcessProxy.response-address', '10.1.2.77:8877', '--RemoteProcessProxy.public-key', 'MIGfMA0GCSqGSIb3DQEBAQUAA4GNADCBiQKBgQDDckhBDJAPcEexUye0Dkhs3v+r1MP2c7cXZ0Sz2rgTwBIfCURFxJYnVhWJCZgSFBs0UnwF21Q2uHLdJkez8ki9xlyRvqX/dNmk+oK485ZaXHpCXVoKPJGtHsEu16hTAeLo01VVw5xZjF18Ti69L4XcVhlmp+oaOoYox/IXFxu1jQIDAQAB']'
[D 2022-09-18 05:34:59.284 EnterpriseGatewayApp] Received payload ''
[D 2022-09-18 05:34:59.284 EnterpriseGatewayApp] decode_payload exception - JSONDecodeError: Expecting value: line 1 column 1 (char 0)
[D 2022-09-18 05:34:59.290 EnterpriseGatewayApp] Received the following exception detecting legacy kernel response - JSONDecodeError: Expecting value: line 1 column 1 (char 0)
[E 2022-09-18 05:34:59.291 EnterpriseGatewayApp] Failure occurred processing connection: Expecting value: line 1 column 1 (char 0)
blair-anson commented 1 year ago

Further to this, I am using JupyterLab 3.0.16 It looks like JupyterLab is continuously requesting kernels, sessions, terminals from the Gateway image

kevin-bates commented 1 year ago

Only the active kernels request should be getting proxied to the Gateway. Terminal and session requests will be handled by the server. You should also see periodic requests (about every 10 seconds) for kernel specs as well - which will also be proxied to the Gateway.

Jupyter Lab (IMHO) does way too much polling and this is why I added kernel-spec caching to EG. I'm hoping Lab will be able to leverage the event framework that is currently being added to Jupyter Server to alleviate some of this.

kevin-bates commented 1 year ago

Back to your issue. Have you modified elyra/kernel-py:2.6.0 in any way? What is contained in the kernel pod's logs? Note that this can be tricky to capture, especially if your using the default namespace. It may be easier to create a fix namespace, then add the KERNEL_NAMESPACE env where the server is running (and restart the server). This env will then flow to EG and it will launch the kernel pod in that namespace.

blair-anson commented 1 year ago

No, I have not modified elyra/kernel-py:2.6.0. Logs for that kernel are below.

I do build EG from source following the make process and packaged up in my own docker image. So I tried using the image elyra/enterprise-gateway:2.6.0 instead, but I had the same issue.

I thought it might be the AWS loadbalancer, so I removed the target groups but the error still continuously appear in the EG logs. Only thing left is the Kubernetes configuration. I did have a lot of trouble trying to get the service exposed via an ingress to an AWS loadbalancer so that is the next likely suspect.

/usr/local/bin/bootstrap-kernel.sh env: SHELL=/bin/bash KUBERNETES_SERVICE_PORT_HTTPS=443 KUBERNETES_SERVICE_PORT=443 KERNEL_NAME=python_kubernetes HOSTNAME=user1-e482933b-8449-453e-a172-318fc822dfc2 LANGUAGE=en_US.UTF-8 KERNEL_SPARK_CONTEXT_INIT_MODE=none KERNEL_ID=e482933b-8449-453e-a172-318fc822dfc2 NB_UID=1000 PWD=/home/user1 RESPONSE_ADDRESS=10.1.1.189:8877 MINICONDA_MD5=87e77f097f6ebb5127c77662dfc3165e HOME=/home/jovyan LANG=en_US.UTF-8 KUBERNETES_PORT_443_TCP=tcp://172.20.0.1:443 NB_GID=100 XDG_CACHE_HOME=/home/jovyan/.cache/ SHLVL=0 CONDA_DIR=/opt/conda MINICONDA_VERSION=4.8.2 KUBERNETES_PORT_443_TCP_PROTO=tcp KUBERNETES_PORT_443_TCP_ADDR=172.20.0.1 PORT_RANGE=0..0 KERNEL_USERNAME=user1 KERNEL_LANGUAGE=python CONDA_VERSION=4.8.2 NB_USER=jovyan KUBERNETES_SERVICE_HOST=172.20.0.1 LC_ALL=en_US.UTF-8 KUBERNETES_PORT=tcp://172.20.0.1:443 KUBERNETES_PORT_443_TCP_PORT=443 PATH=/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/conda/bin PUBLIC_KEY=MIGfMA0GCSqGSIb3DQEBAQUAA4GNADCBiQKBgQDhkGFtwxV20wi8jupGgbQSpnQwmzCI8WllZGGYcZbPEgJihakxMnj+3rW0TEtVsB2PXJGt81Y7+GEFn4B30wb9tO2Qn0P6e+5SwdBn1cXJw/tsP5xcm7NKdISACcHLzrNnq8bnA5OanDtyj87soFPsHckTj4heeFHAO2ZlIeSxcwIDAQAB DEBIAN_FRONTEND=noninteractive KERNEL_NAMESPACE=user1-e482933b-8449-453e-a172-318fc822dfc2 _=/usr/bin/env
+ python /usr/local/bin/kernel-launchers/python/scripts/launch_ipykernel.py --kernel-id e482933b-8449-453e-a172-318fc822dfc2 --port-range 0..0 --response-address 10.1.1.189:8877 --public-key MIGfMA0GCSqGSIb3DQEBAQUAA4GNADCBiQKBgQDhkGFtwxV20wi8jupGgbQSpnQwmzCI8WllZGGYcZbPEgJihakxMnj+3rW0TEtVsB2PXJGt81Y7+GEFn4B30wb9tO2Qn0P6e+5SwdBn1cXJw/tsP5xcm7NKdISACcHLzrNnq8bnA5OanDtyj87soFPsHckTj4heeFHAO2ZlIeSxcwIDAQAB --spark-context-initialization-mode none
[D 2022-09-23 11:44:46,751.751 launch_ipykernel] Using connection file '/tmp/kernel-e482933b-8449-453e-a172-318fc822dfc2_n50bnjpk.json'.
[I 2022-09-23 11:44:46,752.752 launch_ipykernel] Signal socket bound to host: 0.0.0.0, port: 48273
[D 2022-09-23 11:44:46,752.752 launch_ipykernel] JSON Payload 'b'{"shell_port": 42181, "iopub_port": 44443, "stdin_port": 49431, "control_port": 49091, "hb_port": 37043, "ip": "0.0.0.0", "key": "3acd22ab-c0c0-407d-96e3-1e794757935e", "transport": "tcp", "signature_scheme": "hmac-sha256", "kernel_name": "", "pid": 9, "pgid": 7, "comm_port": 48273, "kernel_id": "e482933b-8449-453e-a172-318fc822dfc2"}'
[D 2022-09-23 11:44:46,766.766 launch_ipykernel] Encrypted Payload 'b'eyJ2ZXJzaW9uIjogMSwgImtleSI6ICJoeGk5cGlCUFNySUxFaG4zd3dyQW1uYVFwZTlrQys0QzRaelNhTVZhN1VMZytIVUJKNnJpTDlSRVNVcmQ3Wk5aMHJCSTZKWU1MUVZwbDlVV1hjci96VWVzcTR2cmhGS1BRTzJ4aS8rT3g1aVRGcGNidkNJb1BvNTk0V0s0VGtNdkw3bHJwVys3aTJRNkJoSUM2RE9ieWhFZTAxOEdqQkZ3ZGdweWYrRTVpTGc9IiwgImNvbm5faW5mbyI6ICJ6aWk0clFvbDk0Y0swREl0WFYvY3RmOE4vTWUxSU1RRW9JNlpQN0MvbjlXWXloSDAxUFBMS2hPSDUxOGIzMmdMaWZ2MERBT01nOHUyamk4bkpMR3JKVEFQSFU1ZXZzU0tLbC9QMVNPalJwRkt4M01DOWhSQ1QrUUNkcDdRc1ZyUnVCeUxVTDRmNUpNeGVhWUdTRW0wais1Y2dIUm5sazFLQUUzY2tOQnJ5VDRmU2pkR1kzTUx2V21ReG9QYld2V0g0cDYvRm5kMmU4amltTUcxdHlKMmFURVFvWEZxQzdhNVFoaE9IRFpEVERVek5oc1RnMG12eGlORmdXdVpOZnpYTVpqa3RjMzhHZ2ZnM05Vd082RElISWpRdklQdjNmL0ROcEFjcFovUGt2UittNHFBL29QNTFOWHB1a2d2cSthU0VFZXNTaWRaeWJLTU4zSVoySk9kNWNRRVVmd0I3SWluL0hyb2UwbFY0VmZqQk1SU3VrNTVqMU9FN1JNZ1k3NG1JSEFjYUwyRmo2RHpsWThybkFCNVcrTzloRFI2SVkxNXl1T2Y3aUZMMTBSZVBGUno5T2hFbnRpNTBnWWRYNFBsL2lrakFtSGJlV05iWkRENGlGQUhOQT09In0='
rahul26goyal commented 1 year ago

hi @blair-anson I went over the issue and here is what I have understood: You were facing 2 issues with your setup:

Issue-1: using EG_MIRROR_WORKING_DIRS and KERNEL_WORKING_DIR.

On this, you have updated that it is resolved after following @kevin-bates 's suggestion and working as expected.

Issue-2. Error on JEG server log while starting kernel.

On this, I would like to understand if you were ever able to lunch a kernel successfully using JEG and execute any notebook cell?

blair-anson commented 1 year ago

On this, I would like to understand if you were ever able to lunch a kernel successfully using JEG and execute any notebook cell?

Yes the kernel does launch and run, and I can execute cells in the notebook. It is just that after executing a cell the JEG log gets spammed with these error messages. But even with those error messages filling up the log, executing additional cells in the notebook still works.

[D 2022-09-18 05:34:59.284 EnterpriseGatewayApp] Received payload ''
[D 2022-09-18 05:34:59.284 EnterpriseGatewayApp] decode_payload exception - JSONDecodeError: Expecting value: line 1 column 1 (char 0)
[D 2022-09-18 05:34:59.290 EnterpriseGatewayApp] Received the following exception detecting legacy kernel response - JSONDecodeError: Expecting value: line 1 column 1 (char 0)
[E 2022-09-18 05:34:59.291 EnterpriseGatewayApp] Failure occurred processing connection: Expecting value: line 1 column 1 (char 0)
blair-anson commented 1 year ago

I looked into the JEG codebase and traced the exception to this function

enterprise_gateway/services/processproxies/processproxy.py

Specifically this line to decode the data received from asyncio

https://github.com/jupyter-server/enterprise_gateway/blob/main/enterprise_gateway/services/processproxies/processproxy.py#L277

The data received is empty and hence the json read fails. I added a simple data length check. This of course is just a work around, it doesn't explain why the connection receives empty data.

                if not buffer:  # send is complete, process payload
                    if len(data) > 0:  ## <<----- check for empty data
                        self.log.debug("Received payload '{}'".format(data))
                        payload = self._decode_payload(data)
                        self.log.debug("Decrypted payload '{}'".format(payload))
                        self._post_connection(payload)
                    break
                data = data + buffer.decode(encoding='utf-8')  # append what we received until we get no more...
kevin-bates commented 1 year ago

Hi @blair-anson - what is perplexing to both @rahul26goyal and me is that that code is responsible for receiving the kernel connection information from the remote kernel (pod in this case). So if that isn't working, the kernel should not be working because the server has no way to communicate with it.

The other odd thing is the "spam" that occurs. It sounds like there's a correlation between the empty payload messages and cell execution yet the remote kernel sends the connection payload exactly once. Might you have other kernel pods running from a previously failed attempt? Can you check all namespaces (since you have kernels running in their own namespace - e.g., user1-90e018c1-1469-4b49-bb05-b0fe039511d5 from above) and see if there are others that you don't expect?

In the opening description, you show the log output of the launch and an empty payload, so there's no way a kernel with ID 90e018c1-1469-4b49-bb05-b0fe039511d5 can be communicated with.

Could you please do the following...

  1. Stop EG (terminate the pod) - so that a fresh log is produced on restart.
  2. Ensure all kernel namespaces are deleted outside of the EG namespace.
  3. Restart EG.
  4. Launch ONE kernel. Execute some cells and perform an interrupt operation within the notebook (little square icon). You don't necessarily need to be running any code, I just want to see the interaction in the logs.
  5. Capture the kernel pod log.
  6. Shutdown that one kernel
  7. Capture the EG log.
  8. If you still see the kernel pod running after shutdown (the namespace should not exist at this point), capture the kernel log again.
  9. Provide us with the complete captured logs.

Thank you.

It's good that you're able to use the kernel, we just need to figure out why. :smile:

kevin-bates commented 1 year ago

Also, if you're in a position to try EG 3.0.0rc0, that would be great! That would serve as another datapoint and bring you up to date on the main branch.

kevin-bates commented 1 year ago

FWIW - I just deployed EG 2.6.0 in my kubernetes cluster and get the following log entries...

The expected startup banner...

[I 2022-09-24 15:16:23.120 EnterpriseGatewayApp] Jupyter Enterprise Gateway 2.6.0 is available at http://0.0.0.0:8888

The expected payload response after 8 poll attempts...

[D 2022-09-24 15:18:50.140 EnterpriseGatewayApp] 7: Waiting to connect to k8s pod in namespace 'jovyan-d38e6e8e-dd4a-4d1c-a75e-b6bd0a1452e4'. Name: 'jovyan-d38e6e8e-dd4a-4d1c-a75e-b6bd0a1452e4', Status: 'Pending', Pod IP: 'None', KernelID: 'd38e6e8e-dd4a-4d1c-a75e-b6bd0a1452e4'
[D 2022-09-24 15:18:50.661 EnterpriseGatewayApp] 8: Waiting to connect to k8s pod in namespace 'jovyan-d38e6e8e-dd4a-4d1c-a75e-b6bd0a1452e4'. Name: 'jovyan-d38e6e8e-dd4a-4d1c-a75e-b6bd0a1452e4', Status: 'Pending', Pod IP: 'None', KernelID: 'd38e6e8e-dd4a-4d1c-a75e-b6bd0a1452e4'
[D 2022-09-24 15:18:50.969 EnterpriseGatewayApp] Received payload 'eyJ2ZXJzaW9uIjogMSwgImtleSI6ICJFd1pHc3FJMkRzU3pYUFN2QTdlbmU2b1h1ZzdoRi83Q2tiWXRtbjJybU9iK2hEa3NIMG1mWUhYaFh6eUQ2U0RLVjJVNitNbGxmVDNNWjBWUFJVcy9uQk5UTWxTVy9ZRnlkdlpPWlBKNDEvSExIalFpNEtIUWk2VGdxU3RIVnRzRm5jRWwrSnlhdnBITlB1cHpGcWpmblU1ZUVVTi9MWUJHa0NmRHJDWGtnY1k9IiwgImNvbm5faW5mbyI6ICJnWDhVQW0yUnlhUGZseDlRVVliS3BtUXhnVXUxdE01L0poa0RUSFprNlhnVVZ5ZG5VaUdYaklHYWdPcDhLNC9QUnJ1RmQxbkM0TGwvTG5tdk44aEY0NUNvVHd2V3NOSTVHY25ScHBBNmVSNEJURzZiOHVEZnZyaUVCYWdyclFuY0JvYmJ3bkJydjNoQnN3UzdQU2s0YUJzTEdncThtVTBpYm94NUxmdmpPVk12UzVHNSs0NkZ1OUpaejVoTUhNMDVheTM5eGFzUTV4Ujd1Vi9ZT2U3Y1dVazJMWnB5emJyL1hhblBEdGtmU0huQVNDVzhqaXJaZVF4Qm5oMkNnSzZOMXF0a1U0QXA0T3JERHFPNENYYmNxM2NMMlNUQVRJTGZCRGJJTkd2YTBSNVAwSjk2Rm91K1phVHJaOW9Nb1o1ME1Kems2TXN1Y2txdXBod3JsQTV5SWhYWUNwNklNRUlpT0lYd2VIam82WmphUUFjWWJsUWRVZXFSdWQvZ1NXWi9vZWE2YlJNbUJHZURmc0xzaklBUXpoeEVxRnk4MGVmYXExenlEWnJTWkJzVnFCcW9pMGJlaGxUNG9SYSs4VC9jUHNkTXZJZTdiRVhaM1owWENVcDM2QT09In0='
[D 2022-09-24 15:18:50.969 EnterpriseGatewayApp] Version 1 payload received.
[D 2022-09-24 15:18:50.974 EnterpriseGatewayApp] Decrypted payload '{'shell_port': 36201, 'iopub_port': 55251, 'stdin_port': 48427, 'control_port': 51329, 'hb_port': 43663, 'ip': '0.0.0.0', 'key': '62a9ea6c-f758-4aba-8e99-3200e35dec55', 'transport': 'tcp', 'signature_scheme': 'hmac-sha256', 'kernel_name': '', 'pid': 9, 'pgid': 7, 'comm_port': 47093, 'kernel_id': 'd38e6e8e-dd4a-4d1c-a75e-b6bd0a1452e4'}'
[D 2022-09-24 15:18:50.974 EnterpriseGatewayApp] Connection info received for kernel 'd38e6e8e-dd4a-4d1c-a75e-b6bd0a1452e4': {'shell_port': 36201, 'iopub_port': 55251, 'stdin_port': 48427, 'control_port': 51329, 'hb_port': 43663, 'ip': '0.0.0.0', 'key': '62a9ea6c-f758-4aba-8e99-3200e35dec55', 'transport': 'tcp', 'signature_scheme': 'hmac-sha256', 'kernel_name': '', 'pid': 9, 'pgid': 7, 'comm_port': 47093, 'kernel_id': 'd38e6e8e-dd4a-4d1c-a75e-b6bd0a1452e4'}
[D 2022-09-24 15:18:51.184 EnterpriseGatewayApp] 9: Waiting to connect to k8s pod in namespace 'jovyan-d38e6e8e-dd4a-4d1c-a75e-b6bd0a1452e4'. Name: 'jovyan-d38e6e8e-dd4a-4d1c-a75e-b6bd0a1452e4', Status: 'Running', Pod IP: '10.42.0.19', KernelID: 'd38e6e8e-dd4a-4d1c-a75e-b6bd0a1452e4'
[D 2022-09-24 15:18:51.196 EnterpriseGatewayApp] Host assigned to the kernel is: 'jovyan-d38e6e8e-dd4a-4d1c-a75e-b6bd0a1452e4' '10.42.0.19'
[D 2022-09-24 15:18:51.197 EnterpriseGatewayApp] Established gateway communication to: 10.42.0.19:47093 for KernelID 'd38e6e8e-dd4a-4d1c-a75e-b6bd0a1452e4'

The expected shutdown response...

[I 2022-09-24 15:19:45.278 EnterpriseGatewayApp] Kernel shutdown: d38e6e8e-dd4a-4d1c-a75e-b6bd0a1452e4
[D 2022-09-24 15:19:45.278 EnterpriseGatewayApp] Signal (2) sent via gateway communication port.
[D 2022-09-24 15:19:45.279 EnterpriseGatewayApp] Shutdown request sent to listener via gateway communication port.
[I 220924 15:19:46 web:2239] 200 GET /api/kernels (127.0.0.1) 6.14ms
[D 2022-09-24 15:19:47.807 EnterpriseGatewayApp] Kernel is taking too long to finish, terminating
[D 2022-09-24 15:19:49.290 EnterpriseGatewayApp] KubernetesProcessProxy.terminate_container_resources, pod: jovyan-d38e6e8e-dd4a-4d1c-a75e-b6bd0a1452e4.jovyan-d38e6e8e-dd4a-4d1c-a75e-b6bd0a1452e4, kernel ID: d38e6e8e-dd4a-4d1c-a75e-b6bd0a1452e4 has been terminated.
[I 220924 15:19:49 web:2239] 204 DELETE /api/kernels/d38e6e8e-dd4a-4d1c-a75e-b6bd0a1452e4 (127.0.0.1) 4019.42ms

More questions...

  1. Might you have multiple EG instances deployed or have enabled replicas?
  2. Besides the couple of environment variables you've set to enable the working directory/impersonation stuff, have you changed/modified anything else? You mentioned that you build your EG - are you building your instances from the v2.6.0 tag or main? I'm guessing the former based on your logged banner statement.
blair-anson commented 1 year ago
  1. Might you have multiple EG instances deployed or have enabled replicas?

No, I only see one EG namespace with one EG pod in it

  1. Besides the couple of environment variables you've set to enable the working directory/impersonation stuff, have you changed/modified anything else? You mentioned that you build your EG - are you building your instances from the v2.6.0 tag or main? I'm guessing the former based on your logged banner statement.

I am using tag v2.6.0. Modifications:

Here is the logs as requested and the steps I followed....

  1. I modified EG code & deployed so the JSON decode issue will just appear as a single line "." message in the logs, this helps in making the logs readable
  2. turned off jupyterhub
  3. deleted the existing enterprise-gateway pod
  4. delete existing kernel namespaces (there were about 6 of these)
  5. started log capture for enterprise-gateway
  6. ran jupyterlab with the command below
KERNEL_UID=1001 KERNEL_GID=1016 KERNEL_USERNAME=blair KERNEL_WORKING_DIR=/home/blair KERNEL_VOLUMES="[{name: 'nfs-volume', nfs: {server: 'fs-xxxx.efs.us-west-2.amazonaws.com', path: '/blair'}}]" KERNEL_VOLUME_MOUNTS="[{name: 'nfs-volume', mountPath: '/home/blair'}]" jupyter lab --gateway-url=https://xxxxx:8888 --GatewayClient.http_user=guest --GatewayClient.http_pwd=guest-password
  1. jupyterlab didn't connect to the gateway so I had to refresh the browser (this normally happens after deleting the EG pod)
  2. saw one kernel created
  3. started log capture of the kernel
  4. ran some cells, then pressed the interrupt button
  5. kernel was still running so I selected terminate
  6. I could see the kernel stop and the namespace disappear in kubernetes, no other kernel was running
  7. stopped capturing logs As you can see the issue occurred in the EG log.

I also wanted to mention that when I do the same process using JupyterHub instead of a local JupyterLab I sometimes see a kernel being created when JH connects to EG, even if no notebook is open by default. Then when I open a notebook and and start a kernel a second kernel is created EG. I suspect this may also be related, and is the likely cause of the 6 kernels from step 3). I will capture the logs when using JupyterHub and post them in a following comment.

jupyterlab.log enterprise-gateway.log blair-e25518e8-6c47-4436-9887-960fd7e6eea4.log

blair-anson commented 1 year ago

logs from the JupyterHub test enterprise-gateway.log blair-bd858062-01df-494a-960b-7ded4511021c.log

kevin-bates commented 1 year ago

Thank you for the detailed response! This is extremely odd - I definitely see what you mean by "spamming the log"! We've never seen this before and can't reproduce this.

Are you able to try running a deployment with zero changes - or the minimum necessary to operate within AWS?

Might the kernel pod's restart policy (should be never) have been accidentally modified or side-effected by other changes? (I doubt this is it since we see the spamming occur during the kernel's interaction.)

The kernel is not shutting down gracefully either, so it seems something else is in-play here.

If a Lab session is ended with an open notebook, that notebook will be opened and its kernel started on next invocation. EG will not start kernels upon its startup - only on request via its REST API.

I'm at a loss as to what might be going on here. It's like something outside of the software is hitting the response address ('--RemoteProcessProxy.response-address', '10.1.0.101:8877'), and that only starts happening once a kernel has been started.

I'm also not seeing a discernible pattern in the spam log timestamps. There may be a backoff pattern happening but that's tough to determine based on general log latency.

I'm hoping you can try a vanilla deployment of some form so we can get a better comparison.

The kernel pod should be logging whenever it goes to send the connection information (which your other log shows) but we don't see the spam calls. Also, given that they continue to occur after the kernel pod and its namespace are terminated is, well, alarming.

blair-anson commented 1 year ago

Yes, later this week I plan to start with the basic deployment and test each modification. I will let you know how it goes.

kevin-bates commented 1 year ago

Hi @blair-anson - Could you please send a copy of the launch_ipykernel.py file located in the /usr/local/bin/kernel-launchers/python/scripts of your kernel pod's image, and a copy of both files in the custom_python_kubernetes/scripts directory of your EG pod's image? Thank you.

kevin-bates commented 1 year ago

I've gone ahead and updated the title to reflect the actual issue.

kevin-bates commented 1 year ago

@blair-anson - I suspect this issue might be related to the periodic callback that calls _process_connections() may be firing prior to the previous callback's completion. Given we've never seen this, I can't tell for sure, but I believe your protective measure to ensure the data buffer has content is a good approach since that will break the while loop.

When you added this protection, did things behave correctly?

Hmm. One difference that would occur using the "protective measure" vs. not, is that in the case where the empty payload is processed, an exception (multiple actually) is raised and caught, but the connection is never closed! With the protection in place, the connection is closed.

Therefore, I recommend we make two changes...

  1. Ensure the data buffer has length > 0 (i.e., the "protective measure"). It may also be good to DEBUG log cases in which an empty payload was encountered.
  2. Move the conn.close() call into a finally block that conditionally closes the connection if conn is not None.

I'm going to experiment with timeout values to see if I can reproduce this in some manner, but I think we should make those two changes in our 3.0 release.

blair-anson commented 1 year ago

@kevin-bates apologies for the delay.

Here are the files requested. You will notice in launch_kubernetes.py & kernel-pod.yaml.j2 some customisation for passing environment variables.

When you added this protection, did things behave correctly?

Kind of... I found that starting a kernel often timed out the first time. Then reloading the jupyter page sometimes connected to the new kernel, and other times I had to start a new kernel..then after a while I ended up with 2 running kernels for 1 notebook. Once the first kernel was running then things seemed to work as expected, and adding a additional kernels was no problem. I did increase the timeout but setting up the connection to the first kernel still seems flaky. As I am running JupyterHub remotely to EG, I am not sure if it is a network problem, kubernetes problem, or a EG configuration problem. I have not yet had time to have a proper look at what is going on. Alternatively I may just move to hosting JupyterHub on the same kubernetes cluster as EG to see if that resolves the problem.

kernel-pod.yaml.j2.txt launch_kubernetes.py.txt

launch_ipykernel.py.txt

kevin-bates commented 1 year ago

Thanks - no worries about the delay.

I was thinking, do you deploy the kernel-image-puller and, if not, are your nodes "pre-configured" to include the necessary kernel images? Often, if the image is not present, the initial request will timeout within the time it takes to pull the image, so I'm wondering if that's related to the "first time" issues you're seeing. This scenario fits the notion where you'd wind up with two kernels running since the first does eventually startup (once the image is available).

I'll try to check the files you provided early next week - thanks for their inclusion.