Open tafaust opened 2 years ago
Thank you for opening your first issue in this project! Engagement like this is essential for open source projects! :hugs:
If you haven't done so already, check out Jupyter's Code of Conduct. Also, please try to follow the issue template as it helps other other community members to contribute more effectively.
You can meet the other Jovyans by joining our Discourse forum. There is also an intro thread there where you can stop by and say Hi! :wave:
Welcome to the Jupyter community! :tada:
I should also point out that I tracked the error down to this line in the jupyter-server project: https://github.com/jupyter-server/jupyter_server/blob/7d2154a1e243f80ed5fc4c067fd022e32f3fc8f0/jupyter_server/gateway/managers.py#L70
Hi @tahesse - thanks for opening this issue and the great details! I will try to take a look into this next week but if anyone else wants to look into it, that would be great!
The kernel logs look okay and given EG never appears to receive the kernel connection information, implies there's something amiss between the kernel pod and the EG pod. I'm assuming they're running within the same network - correct?
10.100.44.42
(used by the kernel pod) looks like it might be a public IP while this information implies an internal IP of 172.20.43.27
:
[D 2022-10-07 15:09:11.198 EnterpriseGatewayApp] BaseProcessProxy.launch_process() env: {'SHELL': '/bin/bash', 'KUBERNETES_SERVICE_PORT_HTTPS': '443', 'EG_MIRROR_WORKING_DIRS': 'False', 'KUBERNETES_SERVICE_PORT': '443', 'ENTERPRISE_GATEWAY_PORT_8877_TCP': 'tcp://172.20.43.27:8877', 'EG_NAMESPACE': 'ns-jupyter', 'ENTERPRISE_GATEWAY_SERVICE_PORT_HTTP': '8888', 'HOSTNAME': 'enterprise-gateway-8559c987dd-9lv2p', 'LANGUAGE': 'en_US.UTF-8', 'EG_SHARED_NAMESPACE': 'False', 'EG_PORT': '8888', 'EG_LOG_LEVEL': 'DEBUG', 'JAVA_HOME': '/usr/lib/jvm/java-8-openjdk-amd64', 'NB_UID': '1000', 'ENTERPRISE_GATEWAY_SERVICE_HOST': '172.20.43.27', 'EG_ALLOWED_KERNELS': '"python_kubernetes","spark_python_operator"', 'PWD': '/usr/local/bin', 'ENTERPRISE_GATEWAY_PORT_8877_TCP_PROTO': 'tcp', 'EG_CULL_IDLE_TIMEOUT': '3600', 'EG_DEFAULT_KERNEL_NAME': 'spark_python_operator', 'ENTERPRISE_GATEWAY_SERVICE_PORT_HTTP_RESPONSE': '8877', 'ENTERPRISE_GATEWAY_PORT_8888_TCP_PORT': '8888', 'EG_ENABLE_TUNNELING': 'False', 'ENTERPRISE_GATEWAY_PORT_8888_TCP_ADDR': '172.20.43.27', 'EG_KERNEL_LAUNCH_TIMEOUT': '120', 'HOME': '/home/jovyan', 'LANG': 'en_US.UTF-8', 'KUBERNETES_PORT_443_TCP': 'tcp://172.20.77.1:443', 'ENTERPRISE_GATEWAY_PORT_8877_TCP_PORT': '8877', 'EG_LIST_KERNELS': 'True', 'EG_SSH_PORT': '2122', 'NB_GID': '100', 'EG_RESPONSE_PORT': '8877', 'ENTERPRISE_GATEWAY_PORT_8888_TCP': 'tcp://172.20.43.27:8888', 'KG_PORT': '8888', 'EG_CULL_CONNECTED': 'False', 'EG_PORT_RETRIES': '0', 'KG_IP': '0.0.0.0', 'ENTERPRISE_GATEWAY_PORT_8877_TCP_ADDR': '172.20.43.27', 'EG_CULL_INTERVAL': '60', 'EG_IP': '0.0.0.0', 'SHLVL': '0', 'CONDA_DIR': '/opt/conda', 'ENTERPRISE_GATEWAY_SERVICE_PORT': '8888', 'SPARK_HOME': '/opt/spark', 'KUBERNETES_PORT_443_TCP_PROTO': 'tcp', 'KG_PORT_RETRIES': '0', 'KUBERNETES_PORT_443_TCP_ADDR': '172.20.77.1', 'SPARK_VER': '3.2.1', 'ENTERPRISE_GATEWAY_PORT': 'tcp://172.20.43.27:8888', 'NB_USER': 'jovyan', 'KUBERNETES_SERVICE_HOST': '172.20.77.1', 'ENTERPRISE_GATEWAY_PORT_8888_TCP_PROTO': 'tcp', 'LC_ALL': 'en_US.UTF-8', 'KUBERNETES_PORT': 'tcp://172.20.77.1:443', 'KUBERNETES_PORT_443_TCP_PORT': '443', 'PATH': '/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin', 'EG_KERNEL_CLUSTER_ROLE': 'kernel-controller', 'DEBIAN_FRONTEND': 'noninteractive', 'KERNEL_USERNAME': 'jovyan', 'KERNEL_GATEWAY': '1', 'KERNEL_POD_NAME': 'jovyan-8fb19249-6000-4847-9f2e-cdd3e70d7eb8', 'KERNEL_SERVICE_ACCOUNT_NAME': 'default', 'KERNEL_NAMESPACE': 'jovyan-8fb19249-6000-4847-9f2e-cdd3e70d7eb8', 'KERNEL_IMAGE': 'elyra/kernel-py:dev', 'KERNEL_EXECUTOR_IMAGE': 'elyra/kernel-py:dev', 'KERNEL_UID': '1000', 'KERNEL_GID': '100', 'EG_MIN_PORT_RANGE_SIZE': '1000', 'EG_MAX_PORT_RANGE_RETRIES': '5', 'KERNEL_ID': '8fb19249-6000-4847-9f2e-cdd3e70d7eb8', 'KERNEL_LANGUAGE': 'python', 'EG_IMPERSONATION_ENABLED': 'False'}
Since it appears it's finding the public local IP you might try setting the env EG_PROHIBITED_LOCAL_IPS
to 10.100.44.42
and I suspect it will likely find 172.20.43.27
. That's what that env is intended for - to help mitigate ambiguities.
I will try to deploy your helm chart in my environment upon my return next week although I suspect this issue stems from the cluster's configuration more than EG and probably won't reproduce the issue (but we'll see).
@kevin-bates thank you for the reply!
I'm assuming they're running within the same network - correct?
I'm 99% sure because they run in the same Kubernetes Cluster in different namespace (it does work for other services though). I also did connect via ssh to the remote kernel pod and was able to communicate with the REST API.
10.100.44.42
is indeed an IP address in our subnet (it should be the node IP).
172.20.43.27
does look like a local (container/pod) IP, is that the intended behavior?
I will give EG_PROHIBITED_LOCAL_IPS: '10.100.44.42'
a try. TYSM for the reply @kevin-bates!
I will let you know about the outcome.
10.100.44.42 is indeed an IP address in our subnet (it should be the node IP). 172.20.43.27 does look like a local (container/pod) IP, is that the intended behavior?
Yes, internal is preferred unless the kernel is running in an external network.
@kevin-bates I started my local jupyterlab with:
EG_PROHIBITED_LOCAL_IPS='10.100.*.*' python3 -m jupyterlab --debug \
--gateway-url=http://enterprise-gateway.ns-jupyter:8888 \
--GatewayClient.http_user=guest \
--GatewayClient.http_pwd=guest-password \
--GatewayClient.request_timeout=240.0 \
--GatewayClient.connect_timeout=240.0
My remote kernel logs:
/usr/local/bin/bootstrap-kernel.sh env: SHELL=/bin/bash KUBERNETES_SERVICE_PORT_HTTPS=443 KUBERNETES_SERVICE_PORT=443 KERNEL_NAME=python_kubernetes HOSTNAME=guest-51985e3e-f3e8-4a34-a4e2-69d44c201ce3 LANGUAGE=en_US.UTF-8 KERNEL_SPARK_CONTEXT_INIT_MODE=none KERNEL_ID=51985e3e-f3e8-4a34-a4e2-69d44c201ce3 NB_UID=1000 PWD=/home/jovyan RESPONSE_ADDRESS=10.100.44.42:8877 MINICONDA_MD5=87e77f097f6ebb5127c77662dfc3165e HOME=/home/jovyan LANG=en_US.UTF-8 KUBERNETES_PORT_443_TCP=tcp://172.20.77.1:443 NB_GID=100 XDG_CACHE_HOME=/home/jovyan/.cache/ SHLVL=0 CONDA_DIR=/opt/conda MINICONDA_VERSION=4.8.2 KUBERNETES_PORT_443_TCP_PROTO=tcp KUBERNETES_PORT_443_TCP_ADDR=172.20.77.1 PORT_RANGE=0..0 KERNEL_USERNAME=guest KERNEL_LANGUAGE=python CONDA_VERSION=4.8.2 NB_USER=jovyan KUBERNETES_SERVICE_HOST=172.20.77.1 LC_ALL=en_US.UTF-8 KUBERNETES_PORT=tcp://172.20.77.1:443 KUBERNETES_PORT_443_TCP_PORT=443 PATH=/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/conda/bin PUBLIC_KEY=MIGfMA0GCSqGSIb3DQEBAQUAA4GNADCBiQKBgQDEfiWkzCCMl/VFI8J2042RvWh13bSihVo+xp6HQnnQ8YWO5MsyW/nelzcMa2eBJWB+Yg/IQ/0q6BRog7oqDpUNbUxwGSzU3TyBYeRQCtXynR/EjFNyswE6gQrg15GbFxwmz4nfMkKXtlpItLrslcUqVY+wlUd+sdbJe9YMLp3REwIDAQAB DEBIAN_FRONTEND=noninteractive KERNEL_NAMESPACE=guest-51985e3e-f3e8-4a34-a4e2-69d44c201ce3 _=/usr/bin/env
+ python /usr/local/bin/kernel-launchers/python/scripts/launch_ipykernel.py --kernel-id 51985e3e-f3e8-4a34-a4e2-69d44c201ce3 --port-range 0..0 --response-address 10.100.44.42:8877 --public-key MIGfMA0GCSqGSIb3DQEBAQUAA4GNADCBiQKBgQDEfiWkzCCMl/VFI8J2042RvWh13bSihVo+xp6HQnnQ8YWO5MsyW/nelzcMa2eBJWB+Yg/IQ/0q6BRog7oqDpUNbUxwGSzU3TyBYeRQCtXynR/EjFNyswE6gQrg15GbFxwmz4nfMkKXtlpItLrslcUqVY+wlUd+sdbJe9YMLp3REwIDAQAB --spark-context-initialization-mode none
[D 2022-10-10 06:45:21,298.298 launch_ipykernel] Using connection file '/tmp/kernel-51985e3e-f3e8-4a34-a4e2-69d44c201ce3_kpviwi08.json'.
[I 2022-10-10 06:45:21,300.300 launch_ipykernel] Signal socket bound to host: 0.0.0.0, port: 59223
[D 2022-10-10 06:45:21,301.301 launch_ipykernel] JSON Payload 'b'{"shell_port": 44921, "iopub_port": 46351, "stdin_port": 38143, "control_port": 32833, "hb_port": 43811, "ip": "0.0.0.0", "key": "b6f18ebe-f585-4a45-9897-6f347c3f6ae3", "transport": "tcp", "signature_scheme": "hmac-sha256", "kernel_name": "", "pid": 9, "pgid": 7, "comm_port": 59223, "kernel_id": "51985e3e-f3e8-4a34-a4e2-69d44c201ce3"}'
[D 2022-10-10 06:45:21,348.348 launch_ipykernel] Encrypted Payload 'b'eyJ2ZXJzaW9uIjogMSwgImtleSI6ICJYd1ZyU2MwUWFoSlIwZHJ3YWNmaThaYTYzTWtQM3ZkTGtEMnl2b0NJc0I5SUMyOTlSU3A4c2w2N1d3VGxXSzBtME1ERXpjU3VVdzJVMjltL0R3aWhTNVVpaDFmZk9JaU5RRGxwcDhkKzdDSHM2c3ZXZnE0S29TOWYrMjlxYjl0WDlGdmVXRXNXbXlCc1hWeTZVTDZRZG90QUJXc29SUGE2YzI4UVc2SGlGUXM9IiwgImNvbm5faW5mbyI6ICJaaXFmeFl3UUxobW5HdUxGY0N4S285SFZVSEcrcFYzS3RWaTg5UDdkQnF0bi9EeWMvclo2eUVLaHhkSWpRSXR1Sm9URDZuTzFEN3FDN2pCVFhWTmZ4akRGNjlCYXBnUWVQVzFrOXN5dnRWK0lBTDM1MnpzWFhKeWgxZFE4ZUFyM2F1Mm1tWUFRMVExRzZvbG5kSTlBS2hrSk5KRWo4SC9QVE5zWU9lMFpPZUtpMlF1YTk4QmNRZ3dSaGgzSGpXTE92ZmJBejdBelYvREdzN0hZYjVZSERDUmVuNk1iaElBV21Za1ZQV21mMjB1VlorK2kwSVg5eUFBVDZ2YUR6UWkvcnNEWVNHQ2dUOVhDQ0o5Uk9venBDTXp4NkJuNXJ3Ly9qWGgzNGZqTjRkSGdOa0RMWEtWMWx1QUpDaTJDZ2pUWUxJL2loTkVWeTFwVWl5cnlneUJadG0vdys0eHlJd0F3ay9nK0ZVTjlQaC9sUG92MDNoVnJZTUdOa1JrdjJoVm9vNTA2dVFMbE1kR04zY3dIT294TFdYTk5qZWxUVXBsUkNsUGV4UHFTRS9XMEZ0U0dyYWRNTDVMSExzUEI3TmFmNi9uSjloYmNKN2hvUEtNRWZUSHJ3QT09In0='
and enterprise-gateway logs:
[D 2022-10-10 06:49:08.493 EnterpriseGatewayApp] Waiting for KernelID '51985e3e-f3e8-4a34-a4e2-69d44c201ce3' to send connection info from host 'guest-51985e3e-f3e8-4a34-a4e2-69d44c201ce3' - retrying...
[D 2022-10-10 06:49:09.022 EnterpriseGatewayApp] 417: Waiting to connect to k8s pod in namespace 'guest-51985e3e-f3e8-4a34-a4e2-69d44c201ce3'. Name: 'guest-51985e3e-f3e8-4a34-a4e2-69d44c201ce3', Status: 'Running', Pod IP: '10.100.18.194', KernelID: '51985e3e-f3e8-4a34-a4e2-69d44c201ce3'
[D 2022-10-10 06:49:09.048 EnterpriseGatewayApp] Waiting for KernelID '51985e3e-f3e8-4a34-a4e2-69d44c201ce3' to send connection info from host 'guest-51985e3e-f3e8-4a34-a4e2-69d44c201ce3' - retrying...
[D 2022-10-10 06:49:09.578 EnterpriseGatewayApp] 418: Waiting to connect to k8s pod in namespace 'guest-51985e3e-f3e8-4a34-a4e2-69d44c201ce3'. Name: 'guest-51985e3e-f3e8-4a34-a4e2-69d44c201ce3', Status: 'Running', Pod IP: '10.100.18.194', KernelID: '51985e3e-f3e8-4a34-a4e2-69d44c201ce3'
[D 2022-10-10 06:49:09.606 EnterpriseGatewayApp] Waiting for KernelID '51985e3e-f3e8-4a34-a4e2-69d44c201ce3' to send connection info from host 'guest-51985e3e-f3e8-4a34-a4e2-69d44c201ce3' - retrying...
[D 2022-10-10 06:49:10.131 EnterpriseGatewayApp] 419: Waiting to connect to k8s pod in namespace 'guest-51985e3e-f3e8-4a34-a4e2-69d44c201ce3'. Name: 'guest-51985e3e-f3e8-4a34-a4e2-69d44c201ce3', Status: 'Running', Pod IP: '10.100.18.194', KernelID: '51985e3e-f3e8-4a34-a4e2-69d44c201ce3'
[D 2022-10-10 06:49:10.158 EnterpriseGatewayApp] Waiting for KernelID '51985e3e-f3e8-4a34-a4e2-69d44c201ce3' to send connection info from host 'guest-51985e3e-f3e8-4a34-a4e2-69d44c201ce3' - retrying...
[D 2022-10-10 06:49:10.689 EnterpriseGatewayApp] 420: Waiting to connect to k8s pod in namespace 'guest-51985e3e-f3e8-4a34-a4e2-69d44c201ce3'. Name: 'guest-51985e3e-f3e8-4a34-a4e2-69d44c201ce3', Status: 'Running', Pod IP: '10.100.18.194', KernelID: '51985e3e-f3e8-4a34-a4e2-69d44c201ce3'
[D 2022-10-10 06:49:10.712 EnterpriseGatewayApp] Waiting for KernelID '51985e3e-f3e8-4a34-a4e2-69d44c201ce3' to send connection info from host 'guest-51985e3e-f3e8-4a34-a4e2-69d44c201ce3' - retrying...
[D 2022-10-10 06:49:11.234 EnterpriseGatewayApp] 421: Waiting to connect to k8s pod in namespace 'guest-51985e3e-f3e8-4a34-a4e2-69d44c201ce3'. Name: 'guest-51985e3e-f3e8-4a34-a4e2-69d44c201ce3', Status: 'Running', Pod IP: '10.100.18.194', KernelID: '51985e3e-f3e8-4a34-a4e2-69d44c201ce3'
[D 2022-10-10 06:49:11.255 EnterpriseGatewayApp] Waiting for KernelID '51985e3e-f3e8-4a34-a4e2-69d44c201ce3' to send connection info from host 'guest-51985e3e-f3e8-4a34-a4e2-69d44c201ce3' - retrying...
[D 2022-10-10 06:49:11.779 EnterpriseGatewayApp] 422: Waiting to connect to k8s pod in namespace 'guest-51985e3e-f3e8-4a34-a4e2-69d44c201ce3'. Name: 'guest-51985e3e-f3e8-4a34-a4e2-69d44c201ce3', Status: 'Running', Pod IP: '10.100.18.194', KernelID: '51985e3e-f3e8-4a34-a4e2-69d44c201ce3'
[D 2022-10-10 06:49:11.807 EnterpriseGatewayApp] Waiting for KernelID '51985e3e-f3e8-4a34-a4e2-69d44c201ce3' to send connection info from host 'guest-51985e3e-f3e8-4a34-a4e2-69d44c201ce3' - retrying...
[D 2022-10-10 06:49:12.336 EnterpriseGatewayApp] 423: Waiting to connect to k8s pod in namespace 'guest-51985e3e-f3e8-4a34-a4e2-69d44c201ce3'. Name: 'guest-51985e3e-f3e8-4a34-a4e2-69d44c201ce3', Status: 'Running', Pod IP: '10.100.18.194', KernelID: '51985e3e-f3e8-4a34-a4e2-69d44c201ce3'
[D 2022-10-10 06:49:12.363 EnterpriseGatewayApp] Waiting for KernelID '51985e3e-f3e8-4a34-a4e2-69d44c201ce3' to send connection info from host 'guest-51985e3e-f3e8-4a34-a4e2-69d44c201ce3' - retrying...
[D 2022-10-10 06:49:12.896 EnterpriseGatewayApp] 424: Waiting to connect to k8s pod in namespace 'guest-51985e3e-f3e8-4a34-a4e2-69d44c201ce3'. Name: 'guest-51985e3e-f3e8-4a34-a4e2-69d44c201ce3', Status: 'Running', Pod IP: '10.100.18.194', KernelID: '51985e3e-f3e8-4a34-a4e2-69d44c201ce3'
[D 2022-10-10 06:49:12.925 EnterpriseGatewayApp] Waiting for KernelID '51985e3e-f3e8-4a34-a4e2-69d44c201ce3' to send connection info from host 'guest-51985e3e-f3e8-4a34-a4e2-69d44c201ce3' - retrying...
[D 2022-10-10 06:49:13.448 EnterpriseGatewayApp] 425: Waiting to connect to k8s pod in namespace 'guest-51985e3e-f3e8-4a34-a4e2-69d44c201ce3'. Name: 'guest-51985e3e-f3e8-4a34-a4e2-69d44c201ce3', Status: 'Running', Pod IP: '10.100.18.194', KernelID: '51985e3e-f3e8-4a34-a4e2-69d44c201ce3'
[D 2022-10-10 06:49:13.475 EnterpriseGatewayApp] Waiting for KernelID '51985e3e-f3e8-4a34-a4e2-69d44c201ce3' to send connection info from host 'guest-51985e3e-f3e8-4a34-a4e2-69d44c201ce3' - retrying...
[D 2022-10-10 06:49:13.999 EnterpriseGatewayApp] 426: Waiting to connect to k8s pod in namespace 'guest-51985e3e-f3e8-4a34-a4e2-69d44c201ce3'. Name: 'guest-51985e3e-f3e8-4a34-a4e2-69d44c201ce3', Status: 'Running', Pod IP: '10.100.18.194', KernelID: '51985e3e-f3e8-4a34-a4e2-69d44c201ce3'
[D 2022-10-10 06:49:14.027 EnterpriseGatewayApp] Waiting for KernelID '51985e3e-f3e8-4a34-a4e2-69d44c201ce3' to send connection info from host 'guest-51985e3e-f3e8-4a34-a4e2-69d44c201ce3' - retrying...
[D 2022-10-10 06:49:14.558 EnterpriseGatewayApp] 427: Waiting to connect to k8s pod in namespace 'guest-51985e3e-f3e8-4a34-a4e2-69d44c201ce3'. Name: 'guest-51985e3e-f3e8-4a34-a4e2-69d44c201ce3', Status: 'Running', Pod IP: '10.100.18.194', KernelID: '51985e3e-f3e8-4a34-a4e2-69d44c201ce3'
[D 2022-10-10 06:49:14.586 EnterpriseGatewayApp] Waiting for KernelID '51985e3e-f3e8-4a34-a4e2-69d44c201ce3' to send connection info from host 'guest-51985e3e-f3e8-4a34-a4e2-69d44c201ce3' - retrying...
[D 2022-10-10 06:49:15.121 EnterpriseGatewayApp] 428: Waiting to connect to k8s pod in namespace 'guest-51985e3e-f3e8-4a34-a4e2-69d44c201ce3'. Name: 'guest-51985e3e-f3e8-4a34-a4e2-69d44c201ce3', Status: 'Running', Pod IP: '10.100.18.194', KernelID: '51985e3e-f3e8-4a34-a4e2-69d44c201ce3'
[D 2022-10-10 06:49:15.151 EnterpriseGatewayApp] Waiting for KernelID '51985e3e-f3e8-4a34-a4e2-69d44c201ce3' to send connection info from host 'guest-51985e3e-f3e8-4a34-a4e2-69d44c201ce3' - retrying...
[D 2022-10-10 06:49:15.687 EnterpriseGatewayApp] 429: Waiting to connect to k8s pod in namespace 'guest-51985e3e-f3e8-4a34-a4e2-69d44c201ce3'. Name: 'guest-51985e3e-f3e8-4a34-a4e2-69d44c201ce3', Status: 'Running', Pod IP: '10.100.18.194', KernelID: '51985e3e-f3e8-4a34-a4e2-69d44c201ce3'
[D 2022-10-10 06:49:15.715 EnterpriseGatewayApp] Waiting for KernelID '51985e3e-f3e8-4a34-a4e2-69d44c201ce3' to send connection info from host 'guest-51985e3e-f3e8-4a34-a4e2-69d44c201ce3' - retrying...
[D 2022-10-10 06:49:16.268 EnterpriseGatewayApp] KubernetesProcessProxy.terminate_container_resources, pod: guest-51985e3e-f3e8-4a34-a4e2-69d44c201ce3.guest-51985e3e-f3e8-4a34-a4e2-69d44c201ce3, kernel ID: 51985e3e-f3e8-4a34-a4e2-69d44c201ce3 has been terminated.
[E 2022-10-10 06:49:16.275 EnterpriseGatewayApp] KernelID: '51985e3e-f3e8-4a34-a4e2-69d44c201ce3' launch timeout due to: Waited too long (238.0s) to get connection file
[E 221010 06:49:16 web:2239] 500 POST /api/kernels (127.0.0.1) 238663.10ms
I don't quite understand why enterprise-gateway still tries to work with the remote IP address in 10.100.0.0/16
.
Thank you for the information and help so far!
I should also point out that I tracked the error down to this line in the jupyter-server project: https://github.com/jupyter-server/jupyter_server/blob/7d2154a1e243f80ed5fc4c067fd022e32f3fc8f0/jupyter_server/gateway/managers.py#L70
Just to be follow up, I am currently stuck at this point.
Is there any way that I can potentially start a kernel locally to debug it? I am currently investigating how this could be further debugged.
EG_PROHIBITED_LOCAL_IPS='10.100..' python3 -m jupyterlab --debug \ --gateway-url=http://enterprise-gateway.ns-jupyter:8888 \ --GatewayClient.http_user=guest \ --GatewayClient.http_pwd=guest-password \ --GatewayClient.request_timeout=240.0 \ --GatewayClient.connect_timeout=240.0
This is setting the env EG_PROHIBITED_LOCAL_IPS
only in the jupyter lab process - where it does not apply. You need to restart your EG process (Kubernetes Pod) with this env set. If you deploy using the helm charts, then add the following entry to the env
stanza in the deployment.yaml
file:
- name: EG_PROHIBITED_LOCAL_IPS
value: "10.100.*.*"
(I'm not certain whether the quotes are necessary or not.)
You can docker exec
into the EG pod and confirm its env prior to launching a kernel.
Is there any way that I can potentially start a kernel locally to debug it?
You could try using python_distributed
and set EG_REMOTE_HOSTS
to "localhost". This will use the DistributedProcessProxy
(not the KubernetesProcessProxy
), but I believe the EG_PROHIBITED_LOCAL_IPS
portion of things is still similar - although the K8s env introduces other networks (like the internal docker networks) that DistributedProcessProxy
does not.
Your primary issue wrt this last exercise is that you're not setting the env into the appropriate process. I recommend sticking with the K8s env for a bit longer since the symptoms are somewhat specific to that env.
Thanks @kevin-bates. Setting the EG_PROHIBITED_LOCAL_IPS
on the deployment.yaml
did work but prevents all connections to/from enterprise-gateway.
I have enterprise-gateway with istio deployed, I will remove it and retry. If that doesn't help, I will set EG_REMOTE_HOSTS
to localhost
.
I will get back with the results tomorrow.
Meanwhile, would an extraEnv
configuration option (as it is done in the jupyter helm file) for enterprise-gateway be a thing? I can quickly craft a PR on that.
Setting the EG_PROHIBITED_LOCAL_IPS on the deployment.yaml did work but prevents all connections to/from enterprise-gateway.
Hmm - this should not have any bearing on the accessibility of EG from applications. Could you clarify what you mean by prevents all connections to/from enterprise-gateway?
I have enterprise-gateway with istio deployed
Hmm, might istio be preventing the response to port 8877
in the first place? That port number is configurable and perhaps something that you need to configure in your ingress.
I will set EG_REMOTE_HOSTS to localhost.
I'm not sure how useful this experiment will be and may not be worth the effort.
Meanwhile, would an extraEnv configuration option (as it is done in the jupyter helm file) for enterprise-gateway be a thing? I can quickly craft a PR on that.
Could you please clarify what you mean by this as well? Typically k8s deployments are performed via helm or some other form of yaml and you're free to add whatever you want - so some details would be helpful.
Hmm - this should not have any bearing on the accessibility of EG from applications. Could you clarify what you mean by prevents all connections to/from enterprise-gateway?
I cannot communicate with the REST API from within the cluster:
curl -vvv http://enterprise-gateway.ns-jupyter:8888/api/kernelspecs 7 ↵ tahesse@gauss
* Trying 127.1.41.1:8888...
* connect to 127.1.41.1 port 8888 failed: Connection refused
* Failed to connect to enterprise-gateway.ns-jupyter port 8888 after 10 ms: Connection refused
* Closing connection 0
curl: (7) Failed to connect to enterprise-gateway.ns-jupyter port 8888 after 10 ms: Connection refused
Hmm, might istio be preventing the response to port 8877 in the first place? That port number is configurable and perhaps something that you need to configure in your ingress.
I have no ingress running for enterprise-gateway; is an ingress mandatory?
I am testing with kubefwd and depend on cluster internal service resolution via service-name.namespace.svc.cluster.local
as described in https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/.
Could you please clarify what you mean by this as well? Typically k8s deployments are performed via helm or some other form of yaml and you're free to add whatever you want - so some details would be helpful.
Sure! For the Jupyterhub k8s deployment (via helm), there is this option in the values.yaml
:
https://github.com/jupyterhub/zero-to-jupyterhub-k8s/blob/main/jupyterhub/values.yaml#L76
which is used as such in the hub deployment:
The advantage is that an operator can directly add the env vars in the values-override.yaml instead of modifying the deployment.yaml (especially nice, when extending the helm chart).
Hmm - this should not have any bearing on the accessibility of EG from applications. Could you clarify what you mean by prevents all connections to/from enterprise-gateway?
I cannot communicate with the REST API from within the cluster:
EG_PROHIBITED_LOCAL_IPS
is only used within the process proxies and should not affect access to EG itself. I'm curious how you determined this worked if you can't trigger the creation of a kernel?
Could you please clarify what you mean by this as well? Typically k8s deployments are performed via helm or some other form of yaml and you're free to add whatever you want - so some details would be helpful.
The advantage is that an operator can directly add the env vars in the values-override.yaml instead of modifying the deployment.yaml (especially nice, when extending the helm chart).
I see, yes, that is helpful. A PR would be great!
I have no ingress running for enterprise-gateway; is an ingress mandatory? I am testing with kubefwd and depend on cluster internal service resolution via service-name.namespace.svc.cluster.local as described in https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/.
No, ingress is not mandatory. Just some form of a reverse proxy is recommended and it looks like you're using Hub. I'm not familiar with kubefwd
but wondering if what you're doing is exacerbating this ability for the kernel pods to communicate their connection information back to EG for whatever reason.
EG_PROHIBITED_LOCAL_IPS is only used within the process proxies and should not affect access to EG itself. I'm curious how you determined this worked if you can't trigger the creation of a kernel?
Sorry, I hope that I can clear up the confusion. kubefwd
does forward all service connections to my host so that I can communicate as if my host is part of the cluster network.
I have to restart kubefwd
after redeployment of enterprise-gateway because it does not refresh connections (it tries to but fails). So this was really an issue on my side because I forgot to restart kubefwd
. Sorry about that!
Just some form of a reverse proxy is recommended and it looks like you're using Hub.
That is my ultimate goal but I first try to make it work with jupyterlab because the development/fix cycle is faster and there are less moving parts.
I'm not familiar with kubefwd but wondering if what you're doing is exacerbating this ability for the kernel pods to communicate their connection information back to EG for whatever reason.
We use kubefwd
consistently during development when developing towards cluster infrastructure. https://github.com/txn2/kubefwd
I see, yes, that is helpful. A PR would be great!
I will craft one after I got it running. :)
Update on my side: I know have added
- name: EG_PROHIBITED_LOCAL_IPS
value: '10.100.*.*'
- name: EG_RESPONSE_ADDRESS
value: '172.20.71.6:8877'
to the EG deployment.yaml and it seems that the response_address is not propagated to the kernel as I'd expect from looking at
whereas the environmental variables exist in the enterprise-gateway pod:
jovyan@enterprise-gateway-6bc565d956-gb4s2:/usr/local/bin$ printenv | grep EG_RESPONSE
EG_RESPONSE_ADDRESS=172.20.71.6:8877
EG_RESPONSE_PORT=8877
+ python /usr/local/bin/kernel-launchers/python/scripts/launch_ipykernel.py --kernel-id 4b6d95f4-241d-4644-b622-f6ff4b54814a --port-range 0..0 --response-address 10.100.35.193:8877 --public-key MIGfMA0GCSqGSIb3DQEBAQUAA4GNADCBiQKBgQDSQP9YFtzoY1v+VwYXd09x/fNEDSFIASwjoAoNA5jiOAKQujgw/xxBge1SnovvlGDjOFkkuK1bfRvECYnHafM98hRGlRVGXzbbw5d6hDHUXQMdXgh1JQJFAV8vMI6o3Sqm3ZJRodYuUDvPbbJRNhSbQEEVuzZN5R5p382gxUUFTQIDAQAB --spark-context-initialization-mode none /usr/local/bin/bootstrap-kernel.sh env: SHELL=/bin/bash KUBERNETES_SERVICE_PORT_HTTPS=443 KUBERNETES_SERVICE_PORT=443 KERNEL_NAME=python_kubernetes HOSTNAME=guest-4b6d95f4-241d-4644-b622-f6ff4b54814a LANGUAGE=en_US.UTF-8 KERNEL_SPARK_CONTEXT_INIT_MODE=none GUEST_1F146A87_8304_41D7_9193_8584C02CF412_UI_SVC_SERVICE_PORT_SPARK_DRIVER_UI_PORT=4040 KERNEL_ID=4b6d95f4-241d-4644-b622-f6ff4b54814a NB_UID=1000 GUEST_1F146A87_8304_41D7_9193_8584C02CF412_UI_SVC_SERVICE_PORT=4040 GUEST_1F146A87_8304_41D7_9193_8584C02CF412_UI_SVC_PORT_4040_TCP_PROTO=tcp PWD=/home/jovyan RESPONSE_ADDRESS=10.100.35.193:8877 GUEST_1F146A87_8304_41D7_9193_8584C02CF412_UI_SVC_PORT_4040_TCP=tcp://172.20.18.60:4040 MINICONDA_MD5=87e77f097f6ebb5127c77662dfc3165e HOME=/home/jovyan LANG=en_US.UTF-8 KUBERNETES_PORT_443_TCP=tcp://172.20.77.1:443 NB_GID=100 GUEST_1F146A87_8304_41D7_9193_8584C02CF412_UI_SVC_PORT_4040_TCP_PORT=4040 XDG_CACHE_HOME=/home/jovyan/.cache/ SHLVL=0 CONDA_DIR=/opt/conda MINICONDA_VERSION=4.8.2 KUBERNETES_PORT_443_TCP_PROTO=tcp KUBERNETES_PORT_443_TCP_ADDR=172.20.77.1 PORT_RANGE=0..0 KERNEL_USERNAME=guest KERNEL_LANGUAGE=python CONDA_VERSION=4.8.2 NB_USER=jovyan KUBERNETES_SERVICE_HOST=172.20.77.1 LC_ALL=en_US.UTF-8 KUBERNETES_PORT=tcp://172.20.77.1:443 KUBERNETES_PORT_443_TCP_PORT=443 PATH=/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/conda/bin GUEST_1F146A87_8304_41D7_9193_8584C02CF412_UI_SVC_PORT=tcp://172.20.18.60:4040 GUEST_1F146A87_8304_41D7_9193_8584C02CF412_UI_SVC_PORT_4040_TCP_ADDR=172.20.18.60 PUBLIC_KEY=MIGfMA0GCSqGSIb3DQEBAQUAA4GNADCBiQKBgQDSQP9YFtzoY1v+VwYXd09x/fNEDSFIASwjoAoNA5jiOAKQujgw/xxBge1SnovvlGDjOFkkuK1bfRvECYnHafM98hRGlRVGXzbbw5d6hDHUXQMdXgh1JQJFAV8vMI6o3Sqm3ZJRodYuUDvPbbJRNhSbQEEVuzZN5R5p382gxUUFTQIDAQAB GUEST_1F146A87_8304_41D7_9193_8584C02CF412_UI_SVC_SERVICE_HOST=172.20.18.60 DEBIAN_FRONTEND=noninteractive KERNEL_NAMESPACE=ns-spark-apps _=/usr/bin/env [D 2022-10-12 08:39:46,466.466 launch_ipykernel] Using connection file '/tmp/kernel-4b6d95f4-241d-4644-b622-f6ff4b54814a_v5j3dkkk.json'. [I 2022-10-12 08:39:46,470.470 launch_ipykernel] Signal socket bound to host: 0.0.0.0, port: 39427 Traceback (most recent call last): File "/usr/local/bin/kernel-launchers/python/scripts/launch_ipykernel.py", line 616, inconnection_file, response_addr, lower_port, upper_port, kernel_id, public_key File "/usr/local/bin/kernel-launchers/python/scripts/launch_ipykernel.py", line 269, in return_connection_info s.connect((response_ip, response_port)) ConnectionRefusedError: [Errno 111] Connection refused
I will test if dropping envoy proxies does help.
I can confirm that the issue is due to istio, i.e. the envoy proxy sidecars. I guess that the hotfix for now is to not deploy it with istio.
@kevin-bates Is there interest from the enterprise-gateway maintainer side to support istio?
I can confirm that the issue is due to istio
Great news.
Is there interest from the enterprise-gateway maintainer side to support istio?
We are always interested in supporting configurations our users need. That said, I don't think any of the current maintainers have the bandwidth and/or resources to take this on - so istio's support would need to come in the form of a contribution.
Regarding your previous troubleshooting, setting the env EG_RESPONSE_ADDRESS
isn't going to do anything but set that env value in the EG pod. Nothing uses that env. Instead, you should set EG_RESPONSE_IP
and EG_RESPONSE_PORT
(although the latter is the default of 8877
so its initialization is not necessary). However, since using EG_PROHIBITED_LOCAL_IPS
to mask out 10.100.*.*
is still producing a response IP of 10.100.n.n
indicates that EG cannot find any other local IPs and needs to fallback to the one it finds.
When you got this working by removing Istio from the equation, what kind of response address was computed?
We are always interested in supporting configurations our users need. That said, I don't think any of the current maintainers have the bandwidth and/or resources to take this on - so istio's support would need to come in the form of a contribution.
Maybe https://github.com/splunk/jupyterhub-istio-proxy can serve as a blueprint for implementation. I have a tight schedule but maybe I can pour some time into it. I will have a look at https://jupyter-enterprise-gateway.readthedocs.io/en/latest/contributors/devinstall.html and try to come up with a PR that extends the proxies ability to support istio's envoy sidecar reverse proxies.
Do you want to keep this issue open or start a separate issue for the istio service mesh extension?
I also noticed that if enterprise-gateway runs outside of the istio service mesh, it won't even start python based spark-operator kernels (when spark-operator is running in the service mesh). It won't start a spark driver.
However, since using EG_PROHIBITED_LOCAL_IPS to mask out 10.100.. is still producing a response IP of 10.100.n.n indicates that EG cannot find any other local IPs and needs to fallback to the one it finds.
That fallback is a really nice to know! Thank you! :)
When you got this working by removing Istio from the equation, what kind of response address was computed?
It is an IP within the same /16 subnet, despite the EG_PROHIBITED_LOCAL_IPS
set to 10.100.*.*
:
+ python /usr/local/bin/kernel-launchers/python/scripts/launch_ipykernel.py --kernel-id 443cbb3d-f70f-44a5-8670-3ca57938ccd6 --port-range 0..0 --response-address 10.100.21.198:8877 --public-key MIGfMA0GCSqGSIb3DQEBAQUAA4GNADCBiQKBgQCrzyqh/7jryyFFvLJ20XDI1rsGdatlROT7in70oJCfR2F6FEhwdexv1cVleM6OTTN8NLbvZnUPk+lOKuYxrfNqjJO9wqEd27hM/MtbYPvL5e5v92LH5xiaagWdI7KQfWQfH1t3vnZ4PtoJsxb45ZQvIiDg0vSMjw8NxWhDZpeOxwIDAQAB --spark-context-initialization-mode none
try to come up with a PR that extends the proxies ability to support istio's envoy sidecar reverse proxies.
I don't know anything about Istio, but this implies it gets involved in intra-cluster communications (between pods). Is that correct?
I was hoping this could be something that is configured (either via helm) or within the kernel-pod.yaml used to launch the kernel pods and not require "source code" changes.
I don't know anything about Istio, but this implies it gets involved in intra-cluster communications (between pods). Is that correct?
Yes, istio is basically spawns a sidecar when instructed and handles communication between pods through envoy proxies (thus allowing for secure and traceable transmission).
I was hoping this could be something that is configured (either via helm) or within the kernel-pod.yaml used to launch the kernel pods and not require "source code" changes.
I can promise that I'll look for the least invasive solution that makes enterprise-gateway with istio work. I'm also puzzled why it doesn't work in the first place because the communication between pods is merely tunneled through the proxies AFAIU.
I currently look into solutions that target the helm/declarations:
VirtualService
for the enterprise-gateway)Let's keep this issue open. I've gone ahead and amended the title to include the istio context. Thank you for your help.
@kevin-bates I tried to set
annotations:
proxy.istio.io/config: '{ "holdApplicationUntilProxyStarts": true }
traffic.sidecar.istio.io/excludeOutboundPorts: "8877"
on the kernel pod but without success. I will try to debug the network connection from enterprise-gateway through envoy proxies to the kernel pod and vice versa and get back with the results.
@kevin-bates I did some tests on a clean cluster with EG-unrelated pods/services and monitored the istio traffic. AFAIU istio does pod to pod communication via pod -> svc
(apparently only Endpoint
s according to https://discuss.istio.io/t/503-between-pod-to-pod-communication-1-5-1/6121/15) -> pod
becausae istio does not have the information to route between the pods (contrary to kubedns) it seems (for whatever reason).
It makes sense given that the launch_kubernetes.py
script is able to communicate with the kubernetes api (probably ClusterIP) when it creates the kernel pod. However, it means that EG_PROHIBITED_LOCAL_IPS
set to 10.100.*.*
had no effect (in the istio scenario).
Hence, my proposal for now is to add another k8s resource (Service
) to the kubernetes deployments and make it configurable, i.e. only deploy the Service when istio is configured (or we could additionally attempt to auto-detect it by looking at the enterprise-gateway namespace annotations).
What do you think? I'll test it meanwhile in my cluster and post some code to my ideas asap.
EDIT: https://istio.io/latest/docs/ops/deployment/requirements/#pod-requirements specifically states:
To be part of a mesh, Kubernetes pods must satisfy the following requirements:
- Service association: A pod must belong to at least one Kubernetes service even if the pod does NOT expose any port. If a pod belongs to multiple Kubernetes services, the services cannot use the same port number for different protocols, for instance HTTP and TCP.
Is there any way to nicely debug enterprise-gateway during development? I've worked myself through https://jupyter-enterprise-gateway.readthedocs.io/en/latest/contributors/system-architecture.html and https://jupyter-enterprise-gateway.readthedocs.io/en/latest/contributors/devinstall.html but the development process is still kind of slow. make run-dev
throws an error ERROR: jupyter_enterprise_gateway-*.whl is not a valid wheel filename.
whereas I've not done any building before and was not able to find anything in the Makefile.
Hi @tahesse.
What do you think? I'll test it meanwhile in my cluster and post some code to my ideas asap.
I think a configurable approach is ideal, one that we can easily document and enable via helm deployments - thank you.
Service association: A pod must belong to at least one Kubernetes service even if the pod does NOT expose any port.
What does it mean to "belong to a Kubernetes service"? I'm assuming this implies the Service and the (kernel) pod must reside in the same namespace. Since kernel pods are primarily run in namespaces outside of EG's, does this imply that each launch of the kernel will result in the creation of its own service? And, if folks are specifying their own kernel namespace (via KERNEL_NAMESPACE
), then we would only create a service if one doesn't already exist?
Is there any way to nicely debug enterprise-gateway during development?
Sorry for the hassles here. I use a Mac and run Rancher Desktop for my k8s development. My typically iteration is:
make clean dist enterprise-gateway
to build the elyra/enterprise-gateway:dev
image (clean
probably isn't necessary, but I'm paranoid. :smile:)make kernel-py
to build elyra/kernel-py:dev
helm delete enterprise-gateway -n enterprise-gateway
helm upgrade --install enterprise-gateway etc/kubernetes/helm/enterprise-gateway -n enterprise-gateway
I create aliases for the helm deployments...
alias eg_deploy='helm upgrade --install enterprise-gateway etc/kubernetes/helm/enterprise-gateway -n enterprise-gateway'
alias eg_remove='helm delete enterprise-gateway -n enterprise-gateway'
and another to tail the EG logs...
alias eg_logs='kubectl logs -f deployment.apps/enterprise-gateway -n enterprise-gateway'
If others have an easier workflow, I'd love to hear from you.
What does it mean to "belong to a Kubernetes service"?
I think they refer to association through label and selector between pod/deployment and service.
Thanks for posting your workflow, that helps a lot for sure! I'm running my stuff on silicon Mac M1 which is quite the hassle with arm VS amd arch. My Kubernetes (EKS) was partitioned for. multi-tenant use using Loft. Roughly the same as Rancher I think, that is k8s in docker (kind).
TYSM!
So far, I had to modify few python files to get some "istio_enabled" pivot which can be leveraged to spawn an additional headless k8s service, I'll do some testing tomorrow with the workflow steps you posted. Thanks again!
What do you think about having a kubernetes Service
for the kernel pods at all times? And in the case of istio, i.e. service discovery outside of kube-proxy, make the Service
headless?
Read kubernetes best practices for more information: https://kubernetes.io/docs/concepts/configuration/overview/#services
So far, my changes do not seem to be working with the headless service. As far as I understand, the process hangs at
I don't quite understand if the socket connection might be an issue and moreover, it looks like there are socket connections on both sides which is quite confusing.
I am not sure whether that is due to the missing ports (https://istio.io/latest/docs/ops/configuration/traffic-management/traffic-routing/#headless-services) which are generated within the launch_ipython.py
(there is no chance I can know the ports before they're dynamically generated) or due to the underlying communication (route issues like missing routes for the envoy proxies).
Needs more investigation...
What do you think about having a kubernetes Service for the kernel pods at all times?
You might be interested in #1181. This enables the ability to add your Service
details into the kernel pod template.
Regarding the inability of the server to receive the connection information from the kernel pod, that's odd. However, one of the intentions of the single response address PR was to expose the response port outside of the EG service, but, assuming your kernel pods are within the same cluster, that shouldn't be necessary.
I'm sorry I'm not familiar with Istio.
RE. #1181 the ownerReferences
is really nice! While testing, I was deleting the services manually (postponing the cleanup until I get something working). I adapted the service part with the ownerReference. Thanks!
Regarding the inability of the server to receive the connection information from the kernel pod, that's odd. However, one of the intentions of the single response address PR was to expose the response port outside of the EG service, but, assuming your kernel pods are within the same cluster, that shouldn't be necessary.
Fair enough, the introduced change regarding single response address sounds good and I see the problem with istio here.
I will now try to replace all IP with service (i.e. going through DNS rather than IP) communications. Somehow enterprise-gateway is picking up on the envoy proxies but envoy reverse proxies block the communication it seems. Normally, istio should allow that communication to happen if it bypasses the envoy reverse proxies.
Hi, Any update here how to get enterprise gateway work with istio? If any one has solved can you please provide details?
Hi, Any update here how to get enterprise gateway work with istio? If any one has solved can you please provide details?
Hey, maybe https://github.com/kubeflow/spark-operator/issues/1652 provides some help to you? I left my old job and am not working on this issue anymore.
Hello enterprise-gateway team!
I checked all other issues in the jupyter-server organization and googled a lot without luck yet.
Description
I am trying to run enterprise-gateway in my kubernetes cluster to be able to run remote kernels. I installed enterprise-gateway via helm with the chart from this repository. I extended the helm chart with my own helm chart which does install a namespace with istio labels before installing enterprise-gateway. My
values-override.yaml
looks as follows:For testing purposes, I ran kubefwd for all namespaces to be able to communicate with the enterprise-gateway service. I can successfully call the enterprise-gateway REST endpoints from the CLI, e.g.
curl http://enterprise-gateway.ns-jupyter:8888/api/kernelspecs | jq
yieldsthis JSON response.
However, when I try to spawn a remote kernel via CLI
or via jupyter lab
it spawns the remote kernel
remote kernel logs
but enterprise-gateway is unable to connect to the remote kernel as can be seen from the enterprise-gateway pod logs.
Reproduce
values-override.yaml
from the previous section.'python_kubernetes'
kernel.Expected behavior
I would expect that enterprise-gateway receives the connection information from the remote kernel so that enterprise-gateway does not timeout.
Context
Command Line Output
If I can provide any more information, please let me know!
Thank you for your work on enterprise-gateway, and thanks for any help / pointers in the right direction! :)