jupyter-server / enterprise_gateway

A lightweight, multi-tenant, scalable and secure gateway that enables Jupyter Notebooks to share resources across distributed clusters such as Apache Spark, Kubernetes and others.
https://jupyter-enterprise-gateway.readthedocs.io/en/latest/
Other
623 stars 222 forks source link

Kernel pod into AWS Fargate #792

Closed lucabem closed 4 years ago

lucabem commented 4 years ago

Hi @kevin-bates. I am trying to spawn kernel'pods into AWS Fargate. AWS Fargate works as a normal kubernetes worker node, so i think it could be possible.

The problem is that each time we want to spawn a notebook, we create a node. Because of this, I am having an error starting the kernel. This is because I have to pull the elyra/kernel-py image. I have tried to modify variables EG_LAUNCH_TIMEOUT and KERNEL_LAUNCH_TIMEOUT but I always get a timeout in second 40.

Are there any other variables to indicate that I keep waiting for the image to pull, for example 10 minutes, until I launch the kernel-error?

[E 2020-03-19 09:28:30.158 EnterpriseGatewayApp] KernelID: 'c33d05ea-0da9-40c4-86ae-fc068559ae95' launch timeout due to: Waited too long (40.0s) to get connection file

Enterprise-gateway.yaml

 - name: EG_KERNEL_LAUNCH_TIMEOUT
   value: "600"

Kernel.json

"env": {
    "KERNEL_NAMESPACE": "kernel-ns",
    "KERNEL_SERVICE_ACCOUNT_NAME": "kernel-sa",
    "KERNEL_LAUNCH_TIMEOUT": "600"
  },

kernel-pod.yaml-j2

    - name: KERNEL_LAUNCH_TIMEOUT
      value: "{{ kernel_launch_timeout }}"

I have modified this method to print self.kernel_launch_timeout and i get that:

[E 2020-03-19 11:31:02.430 EnterpriseGatewayApp] TimeOut de 40.0 
y KernelID: 'ff1c2065-d5af-427f-85ae-b8148bd5e3ff' launch timeout due to:
 Waited too long (40.0s) to get connection file

Its looks like pod doesnt care about env variables KERNEL_LAUNCH_TIMEOUT and EG_KERNEL_LAUNCH_TIMEOUT

Version: 2.1.0

lucabem commented 4 years ago

I have edited jupyter notebook variable as you have discussed here and im able to launch pod.

But during pulling image on pod, jupyter notebook shows me kernel-error. Its looks like that JEG is not waiting for pod set up.

After pulling image, my pod is ready but i have kernel-error message and kernel-dead icon

kevin-bates commented 4 years ago

Hi @lucabem - I'm sorry for the frustration about kernel-launch-timeout. I'd be inclined to make it an EG-only value (rather than from the client), but the problem is that it needs to include the request timeout (e.g., KERNEL_LAUNCH_TIMEOUT <= EG_REQUEST_TIMEOUT), that MUST be set in the client to keep the connection open to EG long enough.

Your observation is correct, EG doesn't have any knowledge that it's dealing with k8s, docker, YARN. It just polls for the kernel's readiness. In addition, auto-scaling can be problematic (as discussed in the link you provided) and that's something the KernelImagePuller doesn't really address very well.

The problem is that each time we want to spawn a notebook, we create a node.

Are you creating a new node per notebook server instance, or per notebook instance (i.e., per kernel)? In either case, may I ask why?

If the new node is hosting a notebook server, and then pods are launched from that, I would recommend you use JupyterHub to launch the notebook server since they have richer support for node management - although I'm not sure they address this particular use-case.

I think your only recourse is to make KERNEL_LAUNCH_TIMEOUT sufficiently long to include both node creation, image pull, and kernel launch. One bright note is that it looks like we might have asynchronous kernel startups in the Jupyter stack soon (next few weeks), so these increased start times will not impact the next kernel start request.

lucabem commented 4 years ago

Are you creating a new node per notebook server instance, or per notebook instance (i.e., per kernel)? In either case, may I ask why?

Im spawning kernels into kubernetes cluster. With AWS Fargate, we create one node per kernel-pod. Meanwhile, jupyter-notebook-server is running outside the cluster.

The main purpose of using AWS Fargate is the cost. You just paid for CPU used meanwhile with typical nodes, you will pay no matter if you are not using CPU.

Its quite strange because, it gives me kernel-error but my pod works correctly. Its fault of jupyter-notebook and not of enterprise-gateway i guess.

kevin-bates commented 4 years ago

No - it's not notebook. I'm sure the kernel startup is timing out because the time to create the node, the pod AND download the image, is exceeding the timeout. At least that's what I surmise without having seen the error. Could you please provide the log out that includes the complete startup sequence (from the initial message to the error)?

Again, why are you creating an entire node that hosts a single kernel-pod? That seems extremely wasteful. Is that just how Fargate works??

Are there ways to pre-create nodes - and provide a list of images that exist when the pod creation request occurs?

lucabem commented 4 years ago

Is that just how Fargate works??

Yes, this is how it works. The benefit is that you just pay for resources you use

kevin-bates commented 4 years ago

So are all pods running in their own individual node?

lucabem commented 4 years ago

Yes

kevin-bates commented 4 years ago

Bummer. Can you please provide the complete set of log statements that correspond to the start-kernel request?

lucabem commented 4 years ago

The problem is that i don have any error on enterprise-gateway logs. I have modified jupyterhub to allow KERNEL_LAUNCH_TIMEOUT = 600 so pod could pull image with success.

While enterprise-gateway is waiting to connect to pod (pod is pulling image so is pending status), on jupyterhub log, i get this error:

[W 2020-03-19 13:34:17.467 SingleUserNotebookApp web:1786] 599 POST /user/admin/api/sessions (83.32.156.78): Error attempting to connect to Gateway server url 'http://internal-a563abfec69e211eaba0e06f3e7b0f60-1789890916.eu-west-1.elb.amazonaws.com:8888'.  Ensure gateway url is valid and the Gateway instance is running.
[W 2020-03-19 13:34:17.467 SingleUserNotebookApp handlers:608] Error attempting to connect to Gateway server url 'http://internal-a563abfec69e211eaba0e06f3e7b0f60-1789890916.eu-west-1.elb.amazonaws.com:8888'.  Ensure gateway url is valid and the Gateway instance is running.
[E 2020-03-19 13:34:17.468 SingleUserNotebookApp log:166] {
"X-Forwarded-Host": "ec2-18-202-30-78.eu-west-1.compute.amazonaws.com:8000",
"X-Forwarded-Proto": "http",
"X-Forwarded-Port": "8000",
"X-Forwarded-For": "83.32.156.78",
 "Cookie": "jupyterhub-user-admin=[secret]; _xsrf=[secret]; jupyterhub-session-id=[secret]",
 "Accept-Language": "es-ES,es;q=0.9,en;q=0.8",
 "Accept-Encoding": "gzip, deflate",
 "Referer": "http://ec2-18-202-30-78.eu-west-1.compute.amazonaws.com:8000/user/admin/notebooks/Untitled1.ipynb?kernel_name=prueba_luis",
 "Origin": "http://ec2-18-202-30-78.eu-west-1.compute.amazonaws.com:8000",
 "Content-Type": "application/json",
 "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36",
 "X-Xsrftoken": "2|f446d7e4|32aec25ec8691989b66eea68773bd898|1583749916",
 "X-Requested-With": "XMLHttpRequest",
 "Accept": "application/json, text/javascript, */*; q=0.01",
 "Content-Length": "96",
"Connection": "close",
 "Host": "ec2-18-202-30-78.eu-west-1.compute.amazonaws.com:8000"
}

The begining of start-request

[D 2020-03-19 19:17:18.580 EnterpriseGatewayApp] RemoteMappingKernelManager.start_kernel: prueba_luis, kernel_username: jovyan
[D 2020-03-19 19:17:18.587 EnterpriseGatewayApp] Instantiating kernel 'Python desde el NFS' with process proxy: enterprise_gateway.services.processproxies.k8s.KubernetesProcessProxy
[D 2020-03-19 19:17:18.588 EnterpriseGatewayApp] Response socket launched on '10.98.3.182:42479' using 5.0s timeout
[D 2020-03-19 19:17:18.588 EnterpriseGatewayApp] Starting kernel: ['/opt/conda/bin/python', '/usr/local/share/jupyter/kernels/prueba_luis/scripts/launch_kubernetes.py', '--RemoteProcessProxy.kernel-id', '3a06c8c5-ada1-4ee2-8dc2-6ef7688ddb73', '--RemoteProcessProxy.response-address', '10.98.3.182:42479']
[D 2020-03-19 19:17:18.588 EnterpriseGatewayApp] Launching kernel: Python desde el NFS with command: ['/opt/conda/bin/python', '/usr/local/share/jupyter/kernels/prueba_luis/scripts/launch_kubernetes.py', '--RemoteProcessProxy.kernel-id', '3a06c8c5-ada1-4ee2-8dc2-6ef7688ddb73', '--RemoteProcessProxy.response-address', '10.98.3.182:42479']
[I 2020-03-19 19:17:18.588 EnterpriseGatewayApp] KERNEL_NAMESPACE provided by client: kernel-ns
[D 2020-03-19 19:17:18.588 EnterpriseGatewayApp] BaseProcessProxy.launch_process() env: {'LC_ALL': 'en_US.UTF-8', 'LANG': 'en_US.UTF-8', 'EG_SHARED_NAMESPACE': 'False', 'HOSTNAME': 'enterprise-gateway-b5f4f7bbc-cn827', 'EG_ENABLE_TUNNELING': 'False', 'KG_PORT_RETRIES': '0', 'NB_UID': '1000', 'EG_LOG_LEVEL': 'DEBUG', 'JAVA_HOME': '/usr/lib/jvm/java-8-openjdk-amd64', 'CONDA_DIR': '/opt/conda', 'ENTERPRISE_GATEWAY_PORT_8888_TCP_PORT': '8888', 'CONDA_VERSION': '4.7.12', 'SPARK_VER': '2.4.1', 'KUBERNETES_PORT_443_TCP_PROTO': 'tcp', 'KUBERNETES_PORT_443_TCP_ADDR': '172.20.0.1', 'EG_CULL_IDLE_TIMEOUT': '3600', 'ENTERPRISE_GATEWAY_SERVICE_HOST': '172.20.18.181', 'KUBERNETES_PORT': 'tcp://172.20.0.1:443', 'PWD': '/usr/local/bin', 'HOME': '/home/jovyan', 'MINICONDA_MD5': '81c773ff87af5cfac79ab862942ab6b3', 'ENTERPRISE_GATEWAY_PORT_8888_TCP_PROTO': 'tcp', 'EG_MIRROR_WORKING_DIRS': 'False', 'KUBERNETES_SERVICE_PORT_HTTPS': '443', 'DEBIAN_FRONTEND': 'noninteractive', 'KUBERNETES_PORT_443_TCP_PORT': '443', 'EG_KERNEL_LAUNCH_TIMEOUT': '600', 'EG_SSH_PORT': '2122', 'ENTERPRISE_GATEWAY_SERVICE_PORT_HTTP': '8888', 'EG_CULL_INTERVAL': '60', 'SPARK_HOME': '/opt/spark', 'NB_USER': 'jovyan', 'EG_KERNEL_WHITELIST': "['r_kubernetes','python_kubernetes','prueba_luis']", 'KUBERNETES_PORT_443_TCP': 'tcp://172.20.0.1:443', 'EG_CULL_CONNECTED': 'False', 'EG_PORT_RETRIES': '0', 'KG_PORT': '8888', 'ENTERPRISE_GATEWAY_SERVICE_PORT': '8888', 'SHELL': '/bin/bash', 'ENTERPRISE_GATEWAY_PORT': 'tcp://172.20.18.181:8888', 'ENTERPRISE_GATEWAY_PORT_8888_TCP': 'tcp://172.20.18.181:8888', 'EG_PORT': '8888', 'ENTERPRISE_GATEWAY_PORT_8888_TCP_ADDR': '172.20.18.181', 'SHLVL': '0', 'LANGUAGE': 'en_US.UTF-8', 'EG_KERNEL_CLUSTER_ROLE': 'kernel-controller', 'KUBERNETES_SERVICE_PORT': '443', 'EG_NAMESPACE': 'enterprise-gateway', 'NB_GID': '100', 'KG_IP': '0.0.0.0', 'PATH': '/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin', 'EG_IP': '0.0.0.0', 'KUBERNETES_SERVICE_HOST': '172.20.0.1', 'MINICONDA_VERSION': '4.7.12.1', 'KERNEL_LAUNCH_TIMEOUT': '900', 'KERNEL_USERNAME': 'jovyan', 'KERNEL_NAMESPACE': 'kernel-ns', 'KERNEL_SERVICE_ACCOUNT_NAME': 'kernel-sa', 'KERNEL_GATEWAY': '1', 'KERNEL_POD_NAME': 'jovyan-3a06c8c5-ada1-4ee2-8dc2-6ef7688ddb73', 'KERNEL_IMAGE': 'elyra/kernel-py:2.1.0', 'KERNEL_EXECUTOR_IMAGE': 'elyra/kernel-py:2.1.0', 'KERNEL_UID': '1000', 'KERNEL_GID': '100', 'EG_MIN_PORT_RANGE_SIZE': '1000', 'EG_MAX_PORT_RANGE_RETRIES': '5', 'KERNEL_ID': '3a06c8c5-ada1-4ee2-8dc2-6ef7688ddb73', 'KERNEL_LANGUAGE': 'python', 'EG_IMPERSONATION_ENABLED': 'False'}
[I 2020-03-19 19:17:18.593 EnterpriseGatewayApp] KubernetesProcessProxy: kernel launched. Kernel image: elyra/kernel-py:2.1.0, KernelID: 3a06c8c5-ada1-4ee2-8dc2-6ef7688ddb73, cmd: '['/opt/conda/bin/python', '/usr/local/share/jupyter/kernels/prueba_luis/scripts/launch_kubernetes.py', '--RemoteProcessProxy.kernel-id', '3a06c8c5-ada1-4ee2-8dc2-6ef7688ddb73', '--RemoteProcessProxy.response-address', '10.98.3.182:42479']'
[D 2020-03-19 19:17:19.110 EnterpriseGatewayApp] 1: Waiting to connect to k8s pod in namespace 'kernel-ns'. Name: '', Status: 'None', Pod IP: 'None', KernelID: '3a06c8c5-ada1-4ee2-8dc2-6ef7688ddb73'
{'apiVersion': 'v1', 'kind': 'Pod', 'metadata': {'name': 'jovyan-3a06c8c5-ada1-4ee2-8dc2-6ef7688ddb73', 'namespace': 'kernel-ns', 'labels': {'kernel_id': '3a06c8c5-ada1-4ee2-8dc2-6ef7688ddb73', 'app': 'enterprise-gateway', 'component': 'kernel'}}, 'spec': {'restartPolicy': 'Never', 'serviceAccountName': 'kernel-sa', 'securityContext': {'runAsUser': 1000, 'runAsGroup': 100, 'fsGroup': 100}, 'containers': [{'env': [{'name': 'EG_RESPONSE_ADDRESS', 'value': '10.98.3.182:42479'}, {'name': 'KERNEL_LANGUAGE', 'value': 'python'}, {'name': 'KERNEL_SPARK_CONTEXT_INIT_MODE', 'value': 'none'}, {'name': 'KERNEL_NAME', 'value': 'prueba_luis'}, {'name': 'KERNEL_USERNAME', 'value': 'jovyan'}, {'name': 'KERNEL_ID', 'value': '3a06c8c5-ada1-4ee2-8dc2-6ef7688ddb73'}, {'name': 'KERNEL_NAMESPACE', 'value': 'kernel-ns'}, {'name': 'KERNEL_LAUNCH_TIMEOUT', 'value': '900'}], 'image': 'elyra/kernel-py:2.1.0', 'name': 'jovyan-3a06c8c5-ada1-4ee2-8dc2-6ef7688ddb73', 'volumeMounts': None}], 'volumes': None}}

Thats the log of JEG

[D 2020-03-19 19:24:59.361 EnterpriseGatewayApp] Decrypted Payload '{"shell_port": 45899, "iopub_port": 46317, "stdin_port": 45911, "control_port": 58763, "hb_port": 46447, "ip": "0.0.0.0", "key": "4d419653-ae71-44c7-85ca-1605ab188297", "transport": "tcp", "signature_scheme": "hmac-sha256", "kernel_name": "", "pid": "9", "pgid": "6", "comm_port": 44735}'
[D 2020-03-19 19:24:59.361 EnterpriseGatewayApp] Connect Info received from the launcher is as follows '{'shell_port': 45899, 'iopub_port': 46317, 'stdin_port': 45911, 'control_port': 58763, 'hb_port': 46447, 'ip': '0.0.0.0', 'key': '4d419653-ae71-44c7-85ca-1605ab188297', 'transport': 'tcp', 'signature_scheme': 'hmac-sha256', 'kernel_name': '', 'pid': '9', 'pgid': '6', 'comm_port': 44735}'
[D 2020-03-19 19:24:59.361 EnterpriseGatewayApp] Host assigned to the Kernel is: 'jovyan-3a06c8c5-ada1-4ee2-8dc2-6ef7688ddb73' '10.98.3.99'
[D 2020-03-19 19:24:59.361 EnterpriseGatewayApp] Established gateway communication to: 10.98.3.99:44735 for KernelID '3a06c8c5-ada1-4ee2-8dc2-6ef7688ddb73'
[D 2020-03-19 19:24:59.361 EnterpriseGatewayApp] Updated pid to: 9
[D 2020-03-19 19:24:59.361 EnterpriseGatewayApp] Updated pgid to: 6
[D 2020-03-19 19:24:59.363 EnterpriseGatewayApp] Received connection info for KernelID '3a06c8c5-ada1-4ee2-8dc2-6ef7688ddb73' from host 'jovyan-3a06c8c5-ada1-4ee2-8dc2-6ef7688ddb73': {'shell_port': 45899, 'iopub_port': 46317, 'stdin_port': 45911, 'control_port': 58763, 'hb_port': 46447, 'ip': '10.98.3.99', 'key': '4d419653-ae71-44c7-85ca-1605ab188297', 'transport': 'tcp', 'signature_scheme': 'hmac-sha256', 'kernel_name': '', 'comm_port': 44735}...
[D 2020-03-19 19:24:59.364 EnterpriseGatewayApp] Connecting to: tcp://10.98.3.99:58763
[D 2020-03-19 19:24:59.366 EnterpriseGatewayApp] Connecting to: tcp://10.98.3.99:46317
[I 2020-03-19 19:24:59.367 EnterpriseGatewayApp] Kernel started: 3a06c8c5-ada1-4ee2-8dc2-6ef7688ddb73
[D 2020-03-19 19:24:59.368 EnterpriseGatewayApp] Kernel args: {'env': {'PATH': '/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin', 'KERNEL_LAUNCH_TIMEOUT': '900', 'KERNEL_WORKING_DIR': '/home/nuevo', 'KERNEL_USERNAME': 'jovyan'}, 'kernel_name': 'prueba_luis'}
kevin-bates commented 4 years ago

Thanks for the additional information along with the EG output - that's helpful. So you wind up with a running pod (node) for that kernel, and EG knows about it, but the notebook doesn't because it never completed the request. Is that correct?

A few more questions...

What version of notebook are you using? Are you using --gateway-url or are you going through the NB2KG server extension to hit the EG server?

lucabem commented 4 years ago

but the notebook doesn't because it never completed the request. Is that correct?

Yes, it looks like that. It seems that notebook doesnt wait for kernel set up

Are you using --gateway-url or are you going through the NB2KG server extension to hit the EG server?

Im using via --gateway-url

What version of notebook are you using?

6.0.0

kevin-bates commented 4 years ago

ok - thanks. So the request timeout should be getting set to 600+2, but the connect_timeout value is still probably getting defaulted to 60. I think it might be worth setting env JUPYTER_GATEWAY_CONNECT_TIMEOUT to 602 as well and see if the connection between notebook and EG stays up.

Can you try to determine - perhaps by enabling --debug on your notebook instances - about how long the connection stays up now? If so, I'd be curious if it's about 60 seconds.

This option can also be set via --GatewayClient. connect_timeout=602 if that's easier.

lucabem commented 4 years ago

Yes, it was around 60 seconds. In addition, im working with jupyterhub.

kevin-bates commented 4 years ago

Cool - there's hope then. I don't think it matters that you're using Hub other than a possible inconvenience to set options (which is why I added the comment about using the command line option). However, I should have stated it this way: c.GatewayClient.connect_timeout=602 since you're likely using a file-based approach with Hub.

lucabem commented 4 years ago

Yep, I have tested that var but it doesnt work. It seems that it doesnt work. I dont know why jupyter notebook doesnt override that default value.

kevin-bates commented 4 years ago

Setting these can be tricky. If possible, it might be worth trying to either temporarily modify the default value (to see if we're barking up the right tree with this), or add a log statement just prior to the fetch call that dumps kwargs (to better determine what might be going on).

Sometimes, I'll try setting these to stupid short values, like 1 or 2, expecting the request to fail in that amount of time. If the request continues for the default time, then I know my setting didn't work.

lucabem commented 4 years ago

Thanks for the advice! - Tomorrow I will try to modify the notebook package for debug. In addition, i will study notebook's documentation. Here should be the solution i guess

add a log statement just prior to the fetch call that dumps kwargs

Do you mean to print (***kwargs)

kevin-bates commented 4 years ago

Yes, print or log the kwargs. I'm not sure how much the notebooks docs will help, but they do list all the config options. Please note that in their docs, the data type of the config option is merged (appended) to the name of the option. You can get clearer text by running jupyter notebook --help-all. Here's the section for GatewayClient:

GatewayClient options
---------------------
--GatewayClient.auth_token=<Unicode>
    Default: None
    The authorization token used in the HTTP headers.
    (JUPYTER_GATEWAY_AUTH_TOKEN env var)
--GatewayClient.ca_certs=<Unicode>
    Default: None
    The filename of CA certificates or None to use defaults.
    (JUPYTER_GATEWAY_CA_CERTS env var)
--GatewayClient.client_cert=<Unicode>
    Default: None
    The filename for client SSL certificate, if any.
    (JUPYTER_GATEWAY_CLIENT_CERT env var)
--GatewayClient.client_key=<Unicode>
    Default: None
    The filename for client SSL key, if any.  (JUPYTER_GATEWAY_CLIENT_KEY env
    var)
--GatewayClient.connect_timeout=<Float>
    Default: 60.0
    The time allowed for HTTP connection establishment with the Gateway server.
    (JUPYTER_GATEWAY_CONNECT_TIMEOUT env var)
--GatewayClient.env_whitelist=<Unicode>
    Default: ''
    A comma-separated list of environment variable names that will be included,
    along with their values, in the kernel startup request.  The corresponding
    `env_whitelist` configuration value must also be set on the Gateway server -
    since that configuration value indicates which environmental values to make
    available to the kernel. (JUPYTER_GATEWAY_ENV_WHITELIST env var)
--GatewayClient.headers=<Unicode>
    Default: '{}'
    Additional HTTP headers to pass on the request.  This value will be
    converted to a dict. (JUPYTER_GATEWAY_HEADERS env var)
--GatewayClient.http_pwd=<Unicode>
    Default: None
    The password for HTTP authentication.  (JUPYTER_GATEWAY_HTTP_PWD env var)
--GatewayClient.http_user=<Unicode>
    Default: None
    The username for HTTP authentication. (JUPYTER_GATEWAY_HTTP_USER env var)
--GatewayClient.kernels_endpoint=<Unicode>
    Default: '/api/kernels'
    The gateway API endpoint for accessing kernel resources
    (JUPYTER_GATEWAY_KERNELS_ENDPOINT env var)
--GatewayClient.kernelspecs_endpoint=<Unicode>
    Default: '/api/kernelspecs'
    The gateway API endpoint for accessing kernelspecs
    (JUPYTER_GATEWAY_KERNELSPECS_ENDPOINT env var)
--GatewayClient.kernelspecs_resource_endpoint=<Unicode>
    Default: '/kernelspecs'
    The gateway endpoint for accessing kernelspecs resources
    (JUPYTER_GATEWAY_KERNELSPECS_RESOURCE_ENDPOINT env var)
--GatewayClient.request_timeout=<Float>
    Default: 60.0
    The time allowed for HTTP request completion.
    (JUPYTER_GATEWAY_REQUEST_TIMEOUT env var)
--GatewayClient.url=<Unicode>
    Default: None
    The url of the Kernel or Enterprise Gateway server where kernel
    specifications are defined and kernel management takes place. If defined,
    this Notebook server acts as a proxy for all kernel management and kernel
    specification retrieval.  (JUPYTER_GATEWAY_URL env var)
--GatewayClient.validate_cert=<Bool>
    Default: True
    For HTTPS requests, determines if server's certificate should be validated
    or not. (JUPYTER_GATEWAY_VALIDATE_CERT env var)
--GatewayClient.ws_url=<Unicode>
    Default: None
    The websocket url of the Kernel or Enterprise Gateway server.  If not
    provided, this value will correspond to the value of the Gateway url with
    'ws' in place of 'http'.  (JUPYTER_GATEWAY_WS_URL env var)

Good luck:🤞

lucabem commented 4 years ago

I have modified to use KERNEL_LAUNCH_TIMEOUT env value to 5 seconds and it crash after this 5 seconds, but when I give it a more than 60secs, it crashes at sec 60.

For example, using this command:

export KERNEL_LAUNCH_TIMEOUT=70

jupyter notebook --gateway-url='http:/<jeg-url>:8888' --debug --no-browser --GatewayClient.connect_timeout=540

I get this error. It seems to have a max of 60 secs. If a use 70 sec for example, it crashes at 60 sec. But on HTTP Error, connection key has value keep-alive.

[D 14:05:28.526 NotebookApp] Request new kernel at: http://internal-a3488b5c66aaf11ea97be02a9c06a8b2-909892592.eu-west-1.elb.amazonaws.com:8888/api/kernels
WARNING:root:----------------------------------------
WARNING:root:----------------------------------------
WARNING:root:{'method': 'POST', 'body': '{"name": "prueba_luis", "env": {"KERNEL_LAUNCH_TIMEOUT": "70", "KERNEL_WORKING_DIR": "/home/centos"}}', 'headers': {'Authorization': 'token '}, 'connect_timeout': 540.0, 'request_timeout': 540.0 'validate_cert': True}
WARNING:root:----------------------------------------
WARNING:root:----------------------------------------
[D 14:05:28.534 NotebookApp] 200 GET /api/contents/Untitled13.ipynb/checkpoints?_=1584713135670 (::1) 1.09ms
[D 14:05:28.558 NotebookApp] Path components/MathJax/extensions/Safe.js served from /usr/local/lib/python3.6/site-packages/notebook/static/components/MathJax/extensions/Safe.js
[D 14:05:28.558 NotebookApp] 304 GET /static/components/MathJax/extensions/Safe.js?V=2.7.7 (::1) 1.19ms
[W 14:06:28.207 NotebookApp] 599 POST /api/sessions (::1): Error attempting to connect to Gateway server url 'http://internal-a3488b5c66aaf11ea97be02a9c06a8b2-909892592.eu-west-1.elb.amazonaws.com:8888'.  Ensure gateway url is valid and the Gateway instance is running.
[W 14:06:28.207 NotebookApp] Error attempting to connect to Gateway server url 'http://internal-a3488b5c66aaf11ea97be02a9c06a8b2-909892592.eu-west-1.elb.amazonaws.com:8888'.  Ensure gateway url is valid and the Gateway instance is running.
[E 14:06:28.208 NotebookApp] {
      "Host": "localhost:8888",
      "Connection": "keep-alive",
      "Content-Length": "97",
      "Accept": "application/json, text/javascript, */*; q=0.01",
      "Sec-Fetch-Dest": "empty",
      "X-Requested-With": "XMLHttpRequest",
      "X-Xsrftoken": "2|c2d1d8ec|c9c9c61be65826eaadf57dc4ab043eff|1584707962",
      "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36",
      "Content-Type": "application/json",
      "Origin": "http://localhost:8888",
      "Sec-Fetch-Site": "same-origin",
      "Sec-Fetch-Mode": "cors",
      "Referer": "http://localhost:8888/notebooks/Untitled13.ipynb?kernel_name=prueba_luis",
      "Accept-Encoding": "gzip, deflate, br",
      "Accept-Language": "es-ES,es;q=0.9,en;q=0.8",
      "Cookie": "username-localhost-8888=\"2|1:0|10:1584706769|23:username-localhost-8888|44:ZjQ2NzczZDZhYzE2NDNlN2E1OTNiMmJhMTdmMzhmNTk=|493f27e4da9b19f6ca7a4927811fb684fb0d636f762e303530bb828ef2a627ad\"; _xsrf=2|c2d1d8ec|c9c9c61be65826eaadf57dc4ab043eff|1584707962"
    }
kevin-bates commented 4 years ago

Yes, we know KERNEL_LAUNCH_TIMEOUT is being set correctly - that's not the issue. The issue is with the connect timeout option. It defaults to 60 seconds. That's the one that should be set very low (while the others remain high) to see if your new value is taking effect.

KERNEL_LAUNCH_TIMEOUT should be viewed as an EG-side value. We know that it is working because you didn't get a timeout in the 7-8 MINUTES it took to start the kernel from EG - despite the front-end timing out. The request and connect timeout values are Client-side values, in that they affect the actual connection between the client (notebook) and EG.

However, even if we find a way to extend the connection timeout correctly, I don't think AWS Fargate is a viable environment for kernels on k8s simply due to the startup times. You cannot ask anyone to wait 8 minutes when launching a notebook (starting a kernel). That just isn't acceptable. I will try to read up on Fargate, but if they are truly creating blank nodes per pod, I would imagine all applications have this issue. Only those applications that can "preload" pods would have any chance of working.

If we can't figure a solution to the 8-minute startup issue (even after solving the connection issue) and if you must use Fargate, I would recommend you use JupyterHub and use kernels local to the Notebook pod instance. Then your 8-minute startup issue moves to start the Notebook server but not each kernel - which is more tolerable (although also obnoxious).

lucabem commented 4 years ago

Yes, I understand your concern.

My idea is to create a hybrid platform, where there are kernel-pods that run on ec2 instances (worker nodes) and for exceptional cases that require more power, use fargate.

With this solution we would reduce costs since there would not always have to be an instance (worker-node) that has many resources (each instance is paid per hour while at fargate you pay for the use of cpu) running.

kevin-bates commented 4 years ago

ok - interesting - thanks for the explanation. So for those exceptional cases, an 8 minute start time is part of the "cost" of using expensive resources?

Let's continue to figure out where the timeout is getting imposed and working around that.

lucabem commented 4 years ago

Hi @kevin-bates!

Could be possible that JEG is blocked while pod creation and thats why we are getting a connection timeout from Notebook?

I mean, while pod is pulling image JEG cannot respond notebook requests.

kevin-bates commented 4 years ago

This kind of delay is no different than the delay imposed waiting for the kernel's startup in YARN or regular k8s. It's just that it takes a couple of orders of magnitude longer when image pulling is in play.

Once the async changes are released down the stack (which should be soon), JEG will be able to respond to other start requests, but I don't think the current behavior is causing issues within the single start request (other than it taking too long). At least we must assume that's the case until we determine where the connect timeout is coming from. Have you had a chance to set JUPYTER_GATEWAY_CONNECT_TIMEOUT=600 (or its equivalent config option)?

lucabem commented 4 years ago

This combination of valus works:

We just need to define the enviroment var KERNEL_LAUNCH_TIMEOUT because there is no option to include it via jupyter notebook --options

I have tested it on my local cluster removing kernel-py image.

kevin-bates commented 4 years ago

That's great news.

Sorry about the need to set KERNEL_LAUNCH_TIMEOUT. We ensure that the request timeout is at least the value of KERNEL_LAUNCH_TIMEOUT, but I think it would be good to add some logic that when request_timeout is greater than KLT, we set KERNEL_LAUNCH_TIMEOUT to the request_timeout less the pad. That way, you could set all three via options and, in this case, KLT would become 598 (since the pad is 2 seconds).

It's still unfortunate that startup times on Fargate will always incur the cost of pulling the kernel image (and starting a new node)! As a result, I don't think we can claim support for Fargate.

lucabem commented 4 years ago

I will close that issue. Regarding Fargate, the image pull times are not as high as we saw in the logs (9 minutes). As the pod's resources increased, it has been reduced to 2 minutes or less.

Also, if someone is reading this issue I have to say that if you are using the LoadBalancer type in AWS, you have to change the connection timeout (default is 60 sec)