jupyter-server / enterprise_gateway

A lightweight, multi-tenant, scalable and secure gateway that enables Jupyter Notebooks to share resources across distributed clusters such as Apache Spark, Kubernetes and others.
https://jupyter-enterprise-gateway.readthedocs.io/en/latest/
Other
621 stars 221 forks source link

504 gateway timeout when using JupyterHub #1038

Closed david1155 closed 2 years ago

david1155 commented 2 years ago

Description

Hey, We use Jupyter enterprise-gateway in docker. It takes about 3 minutes to provision kernel via enterprise-gateway in YARN cluster mode when I start notebook manually by docker run ... jupyter lab.

But when notebook docker "jupyter labhub" is spawned by JupyterHub and then YARN kernel is executed, it timeouts exactly in 60 seconds. I have tried to set the following config in JupyterHub but still experience 60 seconds timeouts with error "504 gateway timeout":

c.JupyterHub.service_check_interval = 180

c.Spawner.args = ['--GatewayClient.connect_timeout=900','--GatewayClient.request_timeout=900','--AsyncMappingKernelManager.kernel_info_timeout=900',]

c.JupyterHub.tornado_settings = {
'slow_spawn_timeout': 0,
'connect_timeout': 0,
'request_timeout': 0,
'raise_error': False,
}

c.Spawner.start_timeout = 900
c.JupyterHub.service_check_interval = 900

c.Spawner.environment = {'KERNEL_LAUNCH_TIMEOUT': 900,'EG_KERNEL_LAUNCH_TIMEOUT': 900, 'KG_REQUEST_TIMEOUT': 900, 'EG_CULL_IDLE_TIMEOUT': 900, 'EG_CULL_INTERVAL': 900, 'EG_CULL_CONNECTED': False, 'KG_CONNECT_TIMEOUT': 900, 'KG_REQUEST_TIMEOUT': 900}

It seems that tornado's AsyncHTTPClient is causing timeout. That's why I set _c.JupyterHub.tornadosettings.

Screenshots / Logs

Logs of the EG:

2022-02-07 04:59:19 INFO  ResourceUtils:419 - Unable to find 'resource-types.xml'.
2022-02-07 04:59:19 INFO  Client:57 - Verifying our application has not requested more than the maximum memory capability of the cluster (65536 MB per container)
2022-02-07 04:59:19 INFO  Client:57 - Will allocate AM container, with 1408 MB memory including 384 MB overhead
2022-02-07 04:59:19 INFO  Client:57 - Setting up container launch context for our AM
2022-02-07 04:59:19 INFO  Client:57 - Setting up the launch environment for our AM container
[D 2022-02-07 04:59:19.741 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '6f66fa78-ef9c-4907-a80e-7255d0a82d18' - retrying...
2022-02-07 04:59:19 INFO  Client:57 - Preparing resources for our AM container
2022-02-07 04:59:19 INFO  Client:57 - Source and destination file systems are the same. Not copying hdfs://10.1.1.1:8020/user/jovyan/spark-libs.jar
2022-02-07 04:59:19 INFO  Client:57 - Source and destination file systems are the same. Not copying hdfs://10.1.1.1:8020/user/jovyan/venv.tar.gz#venv
2022-02-07 04:59:19 INFO  Client:57 - Source and destination file systems are the same. Not copying hdfs://10.1.1.11:8020/user/jovyan/pyspark.zip
2022-02-07 04:59:19 INFO  Client:57 - Source and destination file systems are the same. Not copying hdfs://10.1.1.1:8020/user/jovyan/py4j-0.10.7-src.zip
2022-02-07 04:59:19 INFO  Client:57 - Uploading resource file:/usr/local/share/jupyter/kernels/spark_python_yarn_cluster/scripts/launch_ipykernel.py -> hdfs://10.1.1.1:8020/user/jovyan/.sparkStaging/application_1634600395396_88515/launch_ipykernel.py
[D 2022-02-07 04:59:20.266 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '6f66fa78-ef9c-4907-a80e-7255d0a82d18' - retrying...
2022-02-07 04:59:20 WARN  Client:69 - Same name resource file:/usr/local/spark2.4.0-cdh6.3.2/python/lib/pyspark.zip added multiple times to distributed cache
2022-02-07 04:59:20 WARN  Client:69 - Same name resource file:/usr/local/spark2.4.0-cdh6.3.2/python/lib/py4j-0.10.7-src.zip added multiple times to distributed cache
2022-02-07 04:59:20 INFO  Client:57 - Uploading resource file:/tmp/spark-4313e8ca-cce4-4393-b303-fb93b3b6734b/__spark_conf__4530428153818442561.zip -> hdfs://10.1.1.1:8020/user/jovyan/.sparkStaging/application_1634600395396_88515/__spark_conf__.zip
2022-02-07 04:59:20 INFO  SecurityManager:57 - Changing view acls to: jovyan
2022-02-07 04:59:20 INFO  SecurityManager:57 - Changing modify acls to: jovyan
2022-02-07 04:59:20 INFO  SecurityManager:57 - Changing view acls groups to: 
2022-02-07 04:59:20 INFO  SecurityManager:57 - Changing modify acls groups to: 
2022-02-07 04:59:20 INFO  SecurityManager:57 - SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(jovyan); groups with view permissions: Set(); users  with modify permissions: Set(jovyan); groups with modify permissions: Set()
2022-02-07 04:59:20 INFO  HiveConf:188 - Found configuration file null
[D 2022-02-07 04:59:20.810 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '6f66fa78-ef9c-4907-a80e-7255d0a82d18' - retrying...
2022-02-07 04:59:20 INFO  YARNHadoopDelegationTokenManager:57 - Attempting to load user's ticket cache.
2022-02-07 04:59:20 INFO  Client:57 - Submitting application application_1634600395396_88515 to ResourceManager
2022-02-07 04:59:21 INFO  YarnClientImpl:310 - Submitted application application_1634600395396_88515
2022-02-07 04:59:21 INFO  Client:57 - Application report for application_1634600395396_88515 (state: ACCEPTED)
2022-02-07 04:59:21 INFO  Client:57 - 
     client token: N/A
     diagnostics: N/A
     ApplicationMaster host: N/A
     ApplicationMaster RPC port: -1
     queue: --redacted--
     start time: 1644209960951
     final status: UNDEFINED
     tracking URL: http://--redacted--:8088/proxy/application_1634600395396_88515/
     user: jovyan
2022-02-07 04:59:21 INFO  ShutdownHookManager:57 - Shutdown hook called
2022-02-07 04:59:21 INFO  ShutdownHookManager:57 - Deleting directory /tmp/spark-f16a364c-47c1-44c8-abbc-59227f24db34
2022-02-07 04:59:21 INFO  ShutdownHookManager:57 - Deleting directory /tmp/spark-4313e8ca-cce4-4393-b303-fb93b3b6734b
[I 2022-02-07 04:59:21.336 EnterpriseGatewayApp] ApplicationID: 'application_1634600395396_88515' assigned for KernelID: '6f66fa78-ef9c-4907-a80e-7255d0a82d18', state: ACCEPTED, 5.0 seconds after starting.
[D 2022-02-07 04:59:21.341 EnterpriseGatewayApp] 9: State: 'ACCEPTED', Host: '', KernelID: '6f66fa78-ef9c-4907-a80e-7255d0a82d18', ApplicationID: 'application_1634600395396_88515'
......
[D 2022-02-07 05:03:58.573 EnterpriseGatewayApp] Waiting for KernelID '6f66fa78-ef9c-4907-a80e-7255d0a82d18' to send connection info from host '--redacted--' - retrying...
[D 2022-02-07 05:03:59.006 EnterpriseGatewayApp] Received payload '7H87lIeQXuMFj9HLCHA7j5glAdqBWQj4uCEfndsn5pK344AEcfSm5f+uJmAkhLk+t2VTKy+ZK4h2agGl3cF3ftJSCMth1iOwXWI9p0ikprlum+WtsealF6gLk8tFWhIlWUDH522nD6ihmeSqqtEwKhlo6D3+/SzT3d9+uKtaRb5cOu9uduPzMF9YWtlqAdRL84cxDkU8UJuk5dSYoxHhlnvbQOOT76uuWhpL8CtJu1ULQ4Lvg3onNYnktOk4FJMLL56eCNaq8d0MDXiWqOTLypZiBZzvmll6HJnKjTBZXDtinkybiQuTS6n1zUXv3kcqhyYXG/NOoBLT9DeiMaUqmBvOCEVD/oEPcgRyKe92UszfLBu0VSn/Il2RPyc5u/0ceb8u+1StBfE6LHVn/c0+jA=='
[D 2022-02-07 05:03:59.007 EnterpriseGatewayApp] decode_payload exception - UnicodeDecodeError: 'utf-8' codec can't decode byte 0xec in position 0: invalid continuation byte
[W 2022-02-07 05:03:59.015 EnterpriseGatewayApp] WARNING!!!! Legacy kernel response received for kernel_id '6f66fa78-ef9c-4907-a80e-7255d0a82d18'! Update kernel launchers to current version!
[D 2022-02-07 05:03:59.015 EnterpriseGatewayApp] Decrypted payload '{'shell_port': 58817, 'iopub_port': 42710, 'stdin_port': 46917, 'control_port': 35796, 'hb_port': 37879, 'ip': '0.0.0.0', 'key': 'fc9a5d8e-f941-46eb-bd23-e59e88de2005', 'transport': 'tcp', 'signature_scheme': 'hmac-sha256', 'kernel_name': '', 'pid': '42210', 'pgid': '41914', 'comm_port': 34291, 'kernel_id': '6f66fa78-ef9c-4907-a80e-7255d0a82d18'}'
[D 2022-02-07 05:03:59.015 EnterpriseGatewayApp] Connection info received for kernel '6f66fa78-ef9c-4907-a80e-7255d0a82d18': {'shell_port': 58817, 'iopub_port': 42710, 'stdin_port': 46917, 'control_port': 35796, 'hb_port': 37879, 'ip': '0.0.0.0', 'key': 'fc9a5d8e-f941-46eb-bd23-e59e88de2005', 'transport': 'tcp', 'signature_scheme': 'hmac-sha256', 'kernel_name': '', 'pid': '42210', 'pgid': '41914', 'comm_port': 34291, 'kernel_id': '6f66fa78-ef9c-4907-a80e-7255d0a82d18'}
[D 2022-02-07 05:03:59.090 EnterpriseGatewayApp] 526: State: 'ACCEPTED', Host: '--redacted--', KernelID: '6f66fa78-ef9c-4907-a80e-7255d0a82d18', ApplicationID: 'application_1634600395396_88515'
[D 2022-02-07 05:03:59.100 EnterpriseGatewayApp] Host assigned to the kernel is: '--redacted--' '--redacted--'
[D 2022-02-07 05:03:59.100 EnterpriseGatewayApp] Established gateway communication to: --redated--:34291 for KernelID '6f66fa78-ef9c-4907-a80e-7255d0a82d18'
[D 2022-02-07 05:03:59.104 EnterpriseGatewayApp] Received connection info for KernelID '6f66fa78-ef9c-4907-a80e-7255d0a82d18' from host '--redacted--': {'shell_port': 58817, 'iopub_port': 42710, 'stdin_port': 46917, 'control_port': 35796, 'hb_port': 37879, 'ip': '--redacted--', 'key': 'fc9a5d8e-f941-46eb-bd23-e59e88de2005', 'transport': 'tcp', 'signature_scheme': 'hmac-sha256', 'kernel_name': '', 'comm_port': 34291, 'kernel_id': '6f66fa78-ef9c-4907-a80e-7255d0a82d18'}...
[D 2022-02-07 05:03:59.106 EnterpriseGatewayApp] Connecting to: tcp://--redacted--:35796
[D 2022-02-07 05:03:59.113 EnterpriseGatewayApp] Connecting to: tcp://--redacted--:42710
[I 2022-02-07 05:03:59.121 EnterpriseGatewayApp] Kernel started: 6f66fa78-ef9c-4907-a80e-7255d0a82d18
[D 2022-02-07 05:03:59.121 EnterpriseGatewayApp] Kernel args: {'env': {'PATH': '/opt/conda/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/spark/bin', 'KERNEL_LAUNCH_TIMEOUT': '900', 'KERNEL_WORKING_DIR': '/home/s.ruzavin', 'KERNEL_USERNAME': 'jovyan'}, 'kernel_headers': {}, 'kernel_name': 'spark_python_yarn_cluster'}

Error in jupyterlab spawned by jupyterhub: Screenshot 2022-02-07 at 09 42 20

Environment

notebook 6.2.0 jupyterhub 1.3.0 jupyterlab 3.0.12

Environment variables are set in docker image:

KERNEL_LAUNCH_TIMEOUT=900
EG_KERNEL_LAUNCH_TIMEOUT=900
KG_REQUEST_TIMEOUT=900
EG_CULL_IDLE_TIMEOUT=43200
EG_CULL_INTERVAL=900
EG_CULL_CONNECTED=False
KG_REQUEST_TIMEOUT=900
KG_CONNECT_TIMEOUT=900

kernel config:

{
  "language": "python",
  "display_name": "Spark - Python (YARN Cluster Mode)",
  "metadata": {
    "process_proxy": {
      "class_name": "enterprise_gateway.services.processproxies.yarn.YarnClusterProcessProxy"
    }
  },
  "env": {
    "SPARK_HOME": "/usr/local/spark2.4.0-cdh6.3.2",
    "PYSPARK_PYTHON": "./venv/bin/python3",
    "PYTHONPATH": "./venv/lib/python3.7/site-packages:./venv/bin/python3",
    "SPARK_OPTS": "--master yarn --deploy-mode cluster --name ${KERNEL_ID:-ERROR__NO__KERNEL_ID} --conf spark.yarn.submit.waitAppCompletion=false --conf spark.yarn.am.waitTime=1d --queue redacted --conf spark.yarn.appMasterEnv.PYTHONUSERBASE=/home/${KERNEL_USERNAME}/.local --conf spark.yarn.appMasterEnv.PYTHONPATH=./venv/lib/python3.7/site-packages:./venv/bin/python3 --conf spark.yarn.appMasterEnv.PATH=./venv/bin:$PATH --conf spark.yarn.archive=hdfs://--redacted--:8020/user/jovyan/spark-libs.jar --archives hdfs://--redacted--:8020/user/jovyan/venv.tar.gz#venv,hdfs://--redacted--:8020/user/jovyan/pyspark.zip,hdfs://--redacted--:8020/user/jovyan/py4j-0.10.7-src.zip ${KERNEL_EXTRA_SPARK_OPTS}",
    "EG_KERNEL_LAUNCH_TIMEOUT": "900",
    "EG_SOCKET_TIMEOUT": "900",
    "KG_CONNECT_TIMEOUT": "900",
    "KG_REQUEST_TIMEOUT": "900",
    "LAUNCH_OPTS": ""
  },
  "argv": [
    "/usr/local/share/jupyter/kernels/spark_python_yarn_cluster/bin/run.sh",
    "--RemoteProcessProxy.kernel-id",
    "{kernel_id}",
    "--RemoteProcessProxy.response-address",
    "--redacted--:8877",
    "--RemoteProcessProxy.port-range",
    "{port_range}",
    "--RemoteProcessProxy.spark-context-initialization-mode",
    "lazy"
  ]
}

When I start docker with user notebook as "docker run ... jupyter lab" then everything is perfect, remote kernel starts in 3-5 minutes respecting env's but writing to logs: [W ...... SingleUserLabApp handlers:622] Error attempting to connect to Gateway server url 'http://--redacted--:8807'. Ensure gateway url is valid and the Gateway instance is running. until notebook gets finally connected to enterprise-gateway.

But when I use JupyterHub to spawn user notebooks with DockerSpawner, the notebook always timeouts 504 gateway timeout in 60 seconds. Enterprise Gateway logs are the same.

When notebook is started with "jupyter labhub" it does not respect the timeout (KERNEL_LAUNCH_TIMEOUT) for connecting to the enterprise-gateway.

Kindly ask to advise how to increase gateway timeout when notebook is spawned as labhub.

kevin-bates commented 2 years ago

Hi @s3uzz - thanks for transferring this discussion from your gitter post.

Unfortunately, you did not include the portion of the EG logs that contain the very start of the start-kernel request. You should see information about which process proxy is in use along with all of the environment variables sent to the process launch. This will show the value (or absence) of KERNEL_LAUNCH_TIMEOUT. However, just as important are the EG_CONNECT_TIMEOUT and EG_REQUEST_TIMEOUT. These, along with KERNEL_LAUNCH_TIMEOUT should be configured in hub's configuration in its Lab environment settings when Hub starts Lab.

So, it might also be helpful to run docker exec -it <container-id-of-Lab-container> bash and capture its environment.

You may also find helpful information in this blog post: https://blog.jupyter.org/on-demand-notebooks-with-jupyterhub-jupyter-enterprise-gateway-and-kubernetes-e8e423695cbf Please keep in mind however, that this blog post was written before NB2KG was integrated with Notebook/Jupyter Server (via GatewayClient) but the env setting portion of things (using extraEnv) should still hold.

cc: @lresende

david1155 commented 2 years ago

Dear Kevin,

Thank you very much for prompt reply.

Logs from EG when I start jupyterlab spawned by jupyterhub:

# docker run --name enterprise-gw-test --rm -it --gpus all -p 8807:8807 -p 8877:8877 harbor.vmet.ro/bigdata/jupyter-team_profile:stable jupyter enterprisegateway --debug
[D 2022-02-07 15:53:55.218 EnterpriseGatewayApp] Searching ['/home/jovyan', '/home/jovyan/.jupyter', '/home/jovyan/.local/etc/jupyter', '/opt/conda/etc/jupyter', '/usr/local/etc/jupyter', '/etc/jupyter'] for config files
[D 2022-02-07 15:53:55.218 EnterpriseGatewayApp] Looking for jupyter_config in /etc/jupyter
[D 2022-02-07 15:53:55.218 EnterpriseGatewayApp] Looking for jupyter_config in /usr/local/etc/jupyter
[D 2022-02-07 15:53:55.219 EnterpriseGatewayApp] Looking for jupyter_config in /opt/conda/etc/jupyter
[D 2022-02-07 15:53:55.219 EnterpriseGatewayApp] Looking for jupyter_config in /home/jovyan/.local/etc/jupyter
[D 2022-02-07 15:53:55.219 EnterpriseGatewayApp] Looking for jupyter_config in /home/jovyan/.jupyter
[D 2022-02-07 15:53:55.219 EnterpriseGatewayApp] Looking for jupyter_config in /home/jovyan
[D 2022-02-07 15:53:55.220 EnterpriseGatewayApp] Looking for jupyter_enterprise_gateway_config in /etc/jupyter
[D 2022-02-07 15:53:55.220 EnterpriseGatewayApp] Looking for jupyter_enterprise_gateway_config in /usr/local/etc/jupyter
[D 2022-02-07 15:53:55.221 EnterpriseGatewayApp] Looking for jupyter_enterprise_gateway_config in /opt/conda/etc/jupyter
[D 2022-02-07 15:53:55.221 EnterpriseGatewayApp] Looking for jupyter_enterprise_gateway_config in /home/jovyan/.local/etc/jupyter
[D 2022-02-07 15:53:55.221 EnterpriseGatewayApp] Looking for jupyter_enterprise_gateway_config in /home/jovyan/.jupyter
[D 2022-02-07 15:53:55.221 EnterpriseGatewayApp] Looking for jupyter_enterprise_gateway_config in /home/jovyan
[D 220207 15:53:55 selector_events:58] Using selector: EpollSelector
[I 2022-02-07 15:53:55.235 EnterpriseGatewayApp] Jupyter Enterprise Gateway 2.6.0 is available at http://0.0.0.0:8807
[D 2022-02-07 15:54:07.660 EnterpriseGatewayApp] Found kernel python3 in /opt/conda/share/jupyter/kernels
[D 2022-02-07 15:54:07.661 EnterpriseGatewayApp] Found kernel spark_scala_yarn_cluster in /opt/conda/share/jupyter/kernels
[D 2022-02-07 15:54:07.661 EnterpriseGatewayApp] Found kernel spark_python_yarn_cluster in /opt/conda/share/jupyter/kernels
[D 2022-02-07 15:54:07.661 EnterpriseGatewayApp] Found kernel spylon-kernel in /opt/conda/share/jupyter/kernels
[I 220207 15:54:08 web:2243] 200 GET /api/kernelspecs (172.17.0.1) 707.44ms
[W 220207 15:54:08 web:1787] 403 GET /api/kernels (172.17.0.1): Forbidden
[W 220207 15:54:08 web:2243] 403 GET /api/kernels (172.17.0.1) 1.11ms
[D 2022-02-07 15:54:08.053 EnterpriseGatewayApp] Found kernel python3 in /opt/conda/share/jupyter/kernels
[D 2022-02-07 15:54:08.053 EnterpriseGatewayApp] Found kernel spark_scala_yarn_cluster in /opt/conda/share/jupyter/kernels
[D 2022-02-07 15:54:08.053 EnterpriseGatewayApp] Found kernel spark_python_yarn_cluster in /opt/conda/share/jupyter/kernels
[D 2022-02-07 15:54:08.053 EnterpriseGatewayApp] Found kernel spylon-kernel in /opt/conda/share/jupyter/kernels
[I 220207 15:54:08 web:2243] 200 GET /api/kernelspecs (172.17.0.1) 4.43ms
[W 220207 15:54:08 web:1787] 403 GET /api/kernels (172.17.0.1): Forbidden
[W 220207 15:54:08 web:2243] 403 GET /api/kernels (172.17.0.1) 0.87ms
[I 2022-02-07 15:54:09.523 EnterpriseGatewayApp] KernelSpecCache: adding/updating kernelspec: python3
[D 2022-02-07 15:54:09.523 EnterpriseGatewayApp] Serving kernel resource from: /opt/conda/share/jupyter/kernels/python3
[I 220207 15:54:09 web:2243] 200 GET /kernelspecs/python3/logo-64x64.png (172.17.0.1) 2.78ms
[I 2022-02-07 15:54:09.526 EnterpriseGatewayApp] KernelSpecCache: adding/updating kernelspec: spark_python_yarn_cluster
[D 2022-02-07 15:54:09.526 EnterpriseGatewayApp] Serving kernel resource from: /opt/conda/share/jupyter/kernels/spark_python_yarn_cluster
[I 220207 15:54:09 web:2243] 200 GET /kernelspecs/spark_python_yarn_cluster/logo-64x64.png (172.17.0.1) 1.82ms
[I 2022-02-07 15:54:09.528 EnterpriseGatewayApp] KernelSpecCache: adding/updating kernelspec: spark_scala_yarn_cluster
[D 2022-02-07 15:54:09.528 EnterpriseGatewayApp] Serving kernel resource from: /opt/conda/share/jupyter/kernels/spark_scala_yarn_cluster
[I 220207 15:54:09 web:2243] 200 GET /kernelspecs/spark_scala_yarn_cluster/logo-64x64.png (172.17.0.1) 2.02ms
[I 2022-02-07 15:54:09.539 EnterpriseGatewayApp] KernelSpecCache: adding/updating kernelspec: spylon-kernel
[D 2022-02-07 15:54:09.539 EnterpriseGatewayApp] Serving kernel resource from: /opt/conda/share/jupyter/kernels/spylon-kernel
[I 220207 15:54:09 web:2243] 200 GET /kernelspecs/spylon-kernel/logo-64x64.png (172.17.0.1) 1.79ms
[D 2022-02-07 15:54:12.617 EnterpriseGatewayApp] RemoteMappingKernelManager.start_kernel: spark_python_yarn_cluster, kernel_username: jovyan
[D 2022-02-07 15:54:12.631 EnterpriseGatewayApp] Instantiating kernel 'Spark - Python (YARN Cluster Mode)' with process proxy: enterprise_gateway.services.processproxies.yarn.YarnClusterProcessProxy
[I 2022-02-07 15:54:13.188 EnterpriseGatewayApp] Enterprise Gateway is bound to port 8877 for remote kernel connection information.
[D 2022-02-07 15:54:13.189 EnterpriseGatewayApp] YarnClusterProcessProxy shutdown wait time adjusted to 15.0 seconds.
[D 2022-02-07 15:54:13.189 EnterpriseGatewayApp] Starting kernel (async): ['/usr/local/share/jupyter/kernels/spark_python_yarn_cluster/bin/run.sh', '--RemoteProcessProxy.kernel-id', '30e4cd13-633e-4e1c-9200-3cb1dc3d44a6', '--RemoteProcessProxy.response-address', '--redacted--:8877', '--RemoteProcessProxy.port-range', '0..0', '--RemoteProcessProxy.spark-context-initialization-mode', 'lazy']
[D 2022-02-07 15:54:13.190 EnterpriseGatewayApp] Launching kernel: 'Spark - Python (YARN Cluster Mode)' with command: ['/usr/local/share/jupyter/kernels/spark_python_yarn_cluster/bin/run.sh', '--RemoteProcessProxy.kernel-id', '30e4cd13-633e-4e1c-9200-3cb1dc3d44a6', '--RemoteProcessProxy.response-address', '--redacted--:8877', '--RemoteProcessProxy.port-range', '0..0', '--RemoteProcessProxy.spark-context-initialization-mode', 'lazy']
[D 2022-02-07 15:54:13.876 EnterpriseGatewayApp] BaseProcessProxy.launch_process() env: {'PATH': '/opt/conda/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/spark/bin', 'KERNEL_LAUNCH_TIMEOUT': '900', 'KERNEL_WORKING_DIR': '/home/s.ruzavin', 'KERNEL_USERNAME': 'jovyan', 'SPARK_HOME': '/usr/local/spark2.4.0-cdh6.3.2', 'PYSPARK_PYTHON': './venv/bin/python3', 'PYTHONPATH': './venv/lib/python3.7/site-packages:./venv/bin/python3', 'SPARK_OPTS': '--master yarn --deploy-mode cluster --name ${KERNEL_ID:-ERROR__NO__KERNEL_ID} --conf spark.yarn.submit.waitAppCompletion=false --conf spark.yarn.am.waitTime=1d --queue root.qvant.DS --conf spark.yarn.appMasterEnv.PYTHONUSERBASE=/home/jovyan/.local --conf spark.yarn.appMasterEnv.PYTHONPATH=./venv/lib/python3.7/site-packages:./venv/bin/python3 --conf spark.yarn.appMasterEnv.PATH=./venv/bin:/opt/conda/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/spark/bin --conf spark.yarn.archive=hdfs://--redacted--:8020/user/jovyan/spark-libs.jar --archives hdfs://--redacted--:8020/user/jovyan/venv.tar.gz#venv,hdfs://--redacted--:8020/user/jovyan/pyspark.zip,hdfs://--redacted--:8020/user/jovyan/py4j-0.10.7-src.zip ${KERNEL_EXTRA_SPARK_OPTS}', 'EG_KERNEL_LAUNCH_TIMEOUT': '900', 'EG_SOCKET_TIMEOUT': '900', 'KG_CONNECT_TIMEOUT': '900', 'KG_REQUEST_TIMEOUT': '900', 'LAUNCH_OPTS': '', 'KERNEL_GATEWAY': '1', 'EG_MIN_PORT_RANGE_SIZE': '1000', 'EG_MAX_PORT_RANGE_RETRIES': '5', 'KERNEL_ID': '30e4cd13-633e-4e1c-9200-3cb1dc3d44a6', 'KERNEL_LANGUAGE': 'python', 'EG_IMPERSONATION_ENABLED': 'False'}
[D 2022-02-07 15:54:13.882 EnterpriseGatewayApp] Yarn cluster kernel launched using YARN RM address: http://--redacted--:8088, pid: 14, Kernel ID: 30e4cd13-633e-4e1c-9200-3cb1dc3d44a6, cmd: '['/usr/local/share/jupyter/kernels/spark_python_yarn_cluster/bin/run.sh', '--RemoteProcessProxy.kernel-id', '30e4cd13-633e-4e1c-9200-3cb1dc3d44a6', '--RemoteProcessProxy.response-address', '--redacted--:8877', '--RemoteProcessProxy.port-range', '0..0', '--RemoteProcessProxy.spark-context-initialization-mode', 'lazy']'

Starting IPython kernel for Spark in Yarn Cluster mode on behalf of user jovyan

+ eval exec /usr/local/spark2.4.0-cdh6.3.2/bin/spark-submit '--master yarn --deploy-mode cluster --name ${KERNEL_ID:-ERROR__NO__KERNEL_ID} --conf spark.yarn.submit.waitAppCompletion=false --conf spark.yarn.am.waitTime=1d --queue root.qvant.DS --conf spark.yarn.appMasterEnv.PYTHONUSERBASE=/home/jovyan/.local --conf spark.yarn.appMasterEnv.PYTHONPATH=./venv/lib/python3.7/site-packages:./venv/bin/python3 --conf spark.yarn.appMasterEnv.PATH=./venv/bin:/opt/conda/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/spark/bin --conf spark.yarn.archive=hdfs://--redacted--:8020/user/jovyan/spark-libs.jar --archives hdfs://--redacted--:8020/user/jovyan/venv.tar.gz#venv,hdfs://--redacted--:8020/user/jovyan/pyspark.zip,hdfs://--redacted--:8020/user/jovyan/py4j-0.10.7-src.zip ${KERNEL_EXTRA_SPARK_OPTS}' '' /usr/local/share/jupyter/kernels/spark_python_yarn_cluster/scripts/launch_ipykernel.py '' --RemoteProcessProxy.kernel-id 30e4cd13-633e-4e1c-9200-3cb1dc3d44a6 --RemoteProcessProxy.response-address --redacted--:8877 --RemoteProcessProxy.port-range 0..0 --RemoteProcessProxy.spark-context-initialization-mode lazy
++ exec /usr/local/spark2.4.0-cdh6.3.2/bin/spark-submit --master yarn --deploy-mode cluster --name 30e4cd13-633e-4e1c-9200-3cb1dc3d44a6 --conf spark.yarn.submit.waitAppCompletion=false --conf spark.yarn.am.waitTime=1d --queue root.qvant.DS --conf spark.yarn.appMasterEnv.PYTHONUSERBASE=/home/jovyan/.local --conf spark.yarn.appMasterEnv.PYTHONPATH=./venv/lib/python3.7/site-packages:./venv/bin/python3 --conf spark.yarn.appMasterEnv.PATH=./venv/bin:/opt/conda/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/spark/bin --conf spark.yarn.archive=hdfs://--redacted--:8020/user/jovyan/spark-libs.jar --archives hdfs://--redacted--:8020/user/jovyan/venv.tar.gz#venv,hdfs://--redacted--:8020/user/jovyan/pyspark.zip,hdfs://--redacted--:8020/user/jovyan/py4j-0.10.7-src.zip /usr/local/share/jupyter/kernels/spark_python_yarn_cluster/scripts/launch_ipykernel.py --RemoteProcessProxy.kernel-id 30e4cd13-633e-4e1c-9200-3cb1dc3d44a6 --RemoteProcessProxy.response-address --redacted--:8877 --RemoteProcessProxy.port-range 0..0 --RemoteProcessProxy.spark-context-initialization-mode lazy
[D 2022-02-07 15:54:14.401 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '30e4cd13-633e-4e1c-9200-3cb1dc3d44a6' - retrying...
[D 2022-02-07 15:54:14.923 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '30e4cd13-633e-4e1c-9200-3cb1dc3d44a6' - retrying...
[D 2022-02-07 15:54:15.444 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '30e4cd13-633e-4e1c-9200-3cb1dc3d44a6' - retrying...
2022-02-07 15:54:15 WARN  NativeCodeLoader:60 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[D 2022-02-07 15:54:15.963 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '30e4cd13-633e-4e1c-9200-3cb1dc3d44a6' - retrying...
[D 2022-02-07 15:54:16.484 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '30e4cd13-633e-4e1c-9200-3cb1dc3d44a6' - retrying...
2022-02-07 15:54:16 INFO  Client:57 - Requesting a new application from cluster with 13 NodeManagers
2022-02-07 15:54:17 INFO  Configuration:2662 - resource-types.xml not found
2022-02-07 15:54:17 INFO  ResourceUtils:419 - Unable to find 'resource-types.xml'.
2022-02-07 15:54:17 INFO  Client:57 - Verifying our application has not requested more than the maximum memory capability of the cluster (65536 MB per container)
2022-02-07 15:54:17 INFO  Client:57 - Will allocate AM container, with 1408 MB memory including 384 MB overhead
[D 2022-02-07 15:54:17.023 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '30e4cd13-633e-4e1c-9200-3cb1dc3d44a6' - retrying...
2022-02-07 15:54:17 INFO  Client:57 - Setting up container launch context for our AM
2022-02-07 15:54:17 INFO  Client:57 - Setting up the launch environment for our AM container
2022-02-07 15:54:17 INFO  Client:57 - Preparing resources for our AM container
2022-02-07 15:54:17 INFO  Client:57 - Source and destination file systems are the same. Not copying hdfs://--redacted--:8020/user/jovyan/spark-libs.jar
2022-02-07 15:54:17 INFO  Client:57 - Source and destination file systems are the same. Not copying hdfs://--redacted--:8020/user/jovyan/venv.tar.gz#venv
2022-02-07 15:54:17 INFO  Client:57 - Source and destination file systems are the same. Not copying hdfs://--redacted--:8020/user/jovyan/pyspark.zip
2022-02-07 15:54:17 INFO  Client:57 - Source and destination file systems are the same. Not copying hdfs://--redacted--:8020/user/jovyan/py4j-0.10.7-src.zip
2022-02-07 15:54:17 INFO  Client:57 - Uploading resource file:/usr/local/share/jupyter/kernels/spark_python_yarn_cluster/scripts/launch_ipykernel.py -> hdfs://--redacted--:8020/user/jovyan/.sparkStaging/application_1634600395396_89258/launch_ipykernel.py
2022-02-07 15:54:17 WARN  Client:69 - Same name resource file:/usr/local/spark2.4.0-cdh6.3.2/python/lib/pyspark.zip added multiple times to distributed cache
2022-02-07 15:54:17 WARN  Client:69 - Same name resource file:/usr/local/spark2.4.0-cdh6.3.2/python/lib/py4j-0.10.7-src.zip added multiple times to distributed cache
[D 2022-02-07 15:54:17.543 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '30e4cd13-633e-4e1c-9200-3cb1dc3d44a6' - retrying...
2022-02-07 15:54:17 INFO  Client:57 - Uploading resource file:/tmp/spark-1c6c145e-6d5b-4597-9c9e-90b3749eb4ac/__spark_conf__5338634888859863417.zip -> hdfs://--redacted--:8020/user/jovyan/.sparkStaging/application_1634600395396_89258/__spark_conf__.zip
[D 2022-02-07 15:54:18.064 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '30e4cd13-633e-4e1c-9200-3cb1dc3d44a6' - retrying...
[W 220207 15:54:18 web:1787] 403 GET /api/kernels (172.17.0.1): Forbidden
[W 220207 15:54:18 web:2243] 403 GET /api/kernels (172.17.0.1) 6.98ms
[D 2022-02-07 15:54:18.582 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '30e4cd13-633e-4e1c-9200-3cb1dc3d44a6' - retrying...
[D 2022-02-07 15:54:19.097 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '30e4cd13-633e-4e1c-9200-3cb1dc3d44a6' - retrying...
2022-02-07 15:54:19 INFO  SecurityManager:57 - Changing view acls to: jovyan
2022-02-07 15:54:19 INFO  SecurityManager:57 - Changing modify acls to: jovyan
2022-02-07 15:54:19 INFO  SecurityManager:57 - Changing view acls groups to: 
2022-02-07 15:54:19 INFO  SecurityManager:57 - Changing modify acls groups to: 
2022-02-07 15:54:19 INFO  SecurityManager:57 - SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(jovyan); groups with view permissions: Set(); users  with modify permissions: Set(jovyan); groups with modify permissions: Set()
2022-02-07 15:54:19 INFO  HiveConf:188 - Found configuration file null
2022-02-07 15:54:19 INFO  YARNHadoopDelegationTokenManager:57 - Attempting to load user's ticket cache.
[D 2022-02-07 15:54:19.619 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '30e4cd13-633e-4e1c-9200-3cb1dc3d44a6' - retrying...
2022-02-07 15:54:19 INFO  Client:57 - Submitting application application_1634600395396_89258 to ResourceManager
2022-02-07 15:54:19 INFO  YarnClientImpl:310 - Submitted application application_1634600395396_89258
2022-02-07 15:54:19 INFO  Client:57 - Application report for application_1634600395396_89258 (state: ACCEPTED)
2022-02-07 15:54:19 INFO  Client:57 - 
     client token: N/A
     diagnostics: N/A
     ApplicationMaster host: N/A
     ApplicationMaster RPC port: -1
     queue: root.qvant.DS
     start time: 1644249259671
     final status: UNDEFINED
     tracking URL: http://--redacted--:8088/proxy/application_1634600395396_89258/
     user: jovyan
2022-02-07 15:54:19 INFO  ShutdownHookManager:57 - Shutdown hook called
2022-02-07 15:54:19 INFO  ShutdownHookManager:57 - Deleting directory /tmp/spark-f8741313-61a5-4da6-9b72-97c125b76eaa
2022-02-07 15:54:19 INFO  ShutdownHookManager:57 - Deleting directory /tmp/spark-1c6c145e-6d5b-4597-9c9e-90b3749eb4ac
[I 2022-02-07 15:54:20.136 EnterpriseGatewayApp] ApplicationID: 'application_1634600395396_89258' assigned for KernelID: '30e4cd13-633e-4e1c-9200-3cb1dc3d44a6', state: ACCEPTED, 7.0 seconds after starting.
[D 2022-02-07 15:54:20.146 EnterpriseGatewayApp] 12: State: 'ACCEPTED', Host: 'ak1-st-data-03.msk.st.vmet.ro', 
.................................
[D 2022-02-07 15:55:53.645 EnterpriseGatewayApp] Waiting for KernelID '30e4cd13-633e-4e1c-9200-3cb1dc3d44a6' to send connection info from host '--redacted--' - retrying...
^C[I 2022-02-07 15:55:53.864 EnterpriseGatewayApp] Interrupted...

I also attached to the spawner docker labhub container:

[I 2022-02-07 17:35:32.247 SingleUserLabApp managers:76] Request start kernel: kernel_id=None, path=''
[W 2022-02-07 17:35:32.249 SingleUserLabApp configurable:193] Config option `kernel_info_timeout` not recognized by `GatewayKernelManager`.
[I 2022-02-07 17:35:32.494 SingleUserLabApp log:181] 204 PUT /user/--redacted--/lab/api/workspaces/auto-r?1644255332407 (--redacted--@::ffff:--redacted--) 1.75ms
[I 2022-02-07 17:35:33.186 SingleUserLabApp log:181] 200 GET /user/--redacted--/ipyparallel/clusters?1644255333099 (s.ruzavin@::ffff:--redacted--) 1.85ms
[W 2022-02-07 17:35:33.945 SingleUserLabApp web:1787] 403 GET /user/--redacted--/api/kernels?1644255333838 (::ffff:--redacted--): Error attempting to connect to Gateway server url 'http://--redacted--:8807'.  Ensure gateway url is valid and the Gateway instance is running.
[W 2022-02-07 17:35:33.946 SingleUserLabApp handlers:622] Error attempting to connect to Gateway server url 'http://--redacted--:8807'.  Ensure gateway url is valid and the Gateway instance is running.

The same errors when I launch jupyter lab manually, but it gets connected after several minutes.

Thank you very much for pointing to EG_CONNECT_TIMEOUT and EG_REQUEST_TIMEOUT in JupyterHub config. I added following code to jupyterhub config:

c.Spawner.environment = {
    'KERNEL_LAUNCH_TIMEOUT': 900,
    'EG_KERNEL_LAUNCH_TIMEOUT': 900,
    'KG_REQUEST_TIMEOUT': 900,
    'EG_CULL_IDLE_TIMEOUT': 900,
    'EG_CULL_INTERVAL': 900,
    'EG_CULL_CONNECTED': False,
    'KG_CONNECT_TIMEOUT': 900,
    'KG_REQUEST_TIMEOUT': 900,
    **'EG_CONNECT_TIMEOUT': 900, 
    'EG_REQUEST_TIMEOUT':900,**
}

Thank you for the link with configuration example with extraEnv for Kubernetes. We use Docker, and according to JupyterHub docs - Configuring user environments I have set following parameters in jupyter_server_config.py:

c.MappingKernelManager.kernel_info_timeout = 900
c.AsyncMappingKernelManager.kernel_info_timeout = 900
c.GatewayKernelManager.kernel_info_timeout = 900

But kernel timeouts again exactly in 60 seconds when spawned by JupyterHub.

kevin-bates commented 2 years ago

ok, this is odd, but I believe we might be setting the wrong envs in the lab instance since this is using the integrated gateway code and not the old NB2KG stuff. (I apologize for continuing that aspect of things as I forgot about the following.)

The env variables that must be set in the Lab process env are JUPYTER_GATEWAY_CONNECT_TIMEOUT and JUPYTER_GATEWAY_REQUEST_TIMEOUT. However, these variables default to 40.0, not 60 seconds, which leads me to wonder if there's another "proxy" kind of timeout happening here between Hub and Lab.

I'm not sure which is easier to configure within Hub, but the applicable configuration options for Lab are:

c.GatewayClient.connect_timeout = 900
c.GatewayClient.request_timeout = 900

I would recommend setting the env values if there's no difference, this way you can confirm the Lab container's env, which you did not provide. Showing the logs for Lab probably won't be helpful - we really need to see the container's env. Are you able to run docker exec -it against the Lab container and get to a shell prompt? Not sure if docker inspect would provide this kind of info, but it may (depending on how Hub starts Lab) and is another reason to prefer using the env route over config. What does docker inspect <lab-contaianer-id> provide?

btw, the connect/request timeout values only apply to Lab (when configured against a Gateway). They are used when establishing connections to the Gateway so have no merit with respect to the Gateway process itself.

When you do start the kernel from Lab after it has been launched from Hub, are you able to determine the kernel DOES start after its 3-5 minute window via the Hadoop/YARN side of things? I know that the 60 timeout interferes on the front-end, but I just want to confirm the back-end (EG and YARN) are behaving as if the front-end is still connected.

david1155 commented 2 years ago

Kevin, thank you for your reply and providing applicable configuration options.

I have updated docker image with JUPYTER_GATEWAY_CONNECT_TIMEOUT and JUPYTER_GATEWAY_REQUEST_TIMEOUT.

Here is environment of container with jupyter labhub spawned by JupyterHub:

NV_CUDA_COMPAT_PACKAGE=cuda-compat-11-4
CONDA_SHLVL=1
LC_ALL=en_US.UTF-8
LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.zst=01;31:*.tzst=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.wim=01;31:*.swm=01;31:*.dwm=01;31:*.esd=01;31:*.jpg=01;35:*.jpeg=01;35:*.mjpg=01;35:*.mjpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.m4a=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.oga=00;36:*.opus=00;36:*.spx=00;36:*.xspf=00;36:
LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64
JUPYTERHUB_CLIENT_ID=jupyterhub-user-s.ruzavin
CONDA_EXE=/opt/conda/bin/conda
MINIFORGE_VERSION=4.9.2-7
JUPYTERHUB_ADMIN_ACCESS=1
LANG=en_US.UTF-8
HOSTNAME=48b19b0602c8
KERNEL_LAUNCH_TIMEOUT=900
KERNELS_FOLDER=/opt/conda/share/jupyter/kernels
NB_UID=1000
GSETTINGS_SCHEMA_DIR_CONDA_BACKUP=
CONDA_PREFIX=/opt/conda
CONDA_DIR=/opt/conda
NV_LIBNPP_VERSION=11.4.0.110-1
CONDA_VERSION=4.9.2
EG_YARN_ENDPOINT=http://--redacted--:8088
NVIDIA_VISIBLE_DEVICES=all
NV_LIBCUSPARSE_VERSION=11.6.0.120-1
_CE_M=
JUPYTERHUB_ACTIVITY_URL=http://--redacted--:8081/hub/api/users/s.ruzavin/activity
EG_CONNECT_TIMEOUT=900
EG_CULL_IDLE_TIMEOUT=900
JUPYTERHUB_BASE_URL=/
EG_ALT_YARN_ENDPOINT=http://--redacted--:8088
NCCL_VERSION=2.11.4-1
PWD=/home/jovyan
NVARCH=x86_64
HOME=/home/jovyan
CONDA_PYTHON_EXE=/opt/conda/bin/python
NV_LIBNCCL_PACKAGE_VERSION=2.11.4-1
EG_MIRROR_WORKING_DIRS=True
JUPYTERHUB_USER=s.ruzavin
NV_LIBNCCL_PACKAGE=libnccl2=2.11.4-1+cuda11.4
DEBIAN_FRONTEND=noninteractive
CONDA_OVERRIDE_CUDA=11.2
NV_CUDA_LIB_VERSION=11.4.3-1
EG_KERNEL_LAUNCH_TIMEOUT=900
_CE_CONDA=
GSETTINGS_SCHEMA_DIR=/opt/conda/share/glib-2.0/schemas
SCALA_KERNEL_DIR=/opt/conda/share/jupyter/kernels/python3
HADOOP_CONF_DIR=/etc/hadoop/conf
KG_CONNECT_TIMEOUT=900
EG_CULL_INTERVAL=900
NV_LIBNPP_PACKAGE=libnpp-11-4=11.4.0.110-1
NV_LIBNCCL_PACKAGE_NAME=libnccl2
SPARK_HOME=/usr/local/spark
NB_USER=jovyan
EG_RESPONSE_PORT=8877
NV_NVTX_VERSION=11.4.120-1
NV_LIBCUBLAS_VERSION=11.6.5.2-1
NV_LIBCUBLAS_PACKAGE=libcublas-11-4=11.6.5.2-1
JUPYTERHUB_SERVICE_PREFIX=/user/--redacted--/
NV_CUDNN_VERSION=8.2.4.15
CONDA_PROMPT_MODIFIER=(base) 
EG_CULL_CONNECTED=False
JUPYTERHUB_SERVER_NAME=
SHELL=/bin/bash
TERM=xterm
JUPYTER_GATEWAY_REQUEST_TIMEOUT== 900
SPARK_OPTS=--driver-java-options=-Xms1024M --driver-java-options=-Xmx4096M --driver-java-options=-Dlog4j.logLevel=info --queue --redacted--
CUDA_VERSION=11.4.3
NV_LIBCUBLAS_PACKAGE_NAME=libcublas-11-4
EG_PORT=8807
EG_REQUEST_TIMEOUT=900
NVIDIA_DRIVER_CAPABILITIES=compute,utility
KG_REQUEST_TIMEOUT=900
JUPYTERHUB_API_URL=http://--redacted--:8081/hub/api
SHLVL=1
LANGUAGE=en_US.UTF-8
JUPYTERHUB_HOST=
NVIDIA_REQUIRE_CUDA=cuda>=11.4 brand=tesla,driver>=418,driver<419 brand=tesla,driver>=440,driver<441 brand=tesla,driver>=450,driver<451 brand=tesla,driver>=460,driver<461
JPY_API_TOKEN=85a4e7f480244d3e9e953757457bed50
NV_CUDA_CUDART_VERSION=11.4.148-1
XDG_CACHE_HOME=/home/jovyan/.cache/
JUPYTERHUB_OAUTH_CALLBACK_URL=/user/s.ruzavin/oauth_callback
NB_GID=100
JUPYTER_GATEWAY_CONNECT_TIMEOUT== 900
NV_CUDNN_PACKAGE_NAME=libcudnn8
PYSPARK_PYTHON=./venv/bin/python3
JUPYTERHUB_API_TOKEN=85a4e7f480244d3e9e953757457bed50
PATH=/opt/conda/bin:/opt/conda/condabin:/opt/conda/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/spark/bin
EG_IP=0.0.0.0
JUPYTER_GATEWAY_URL=http://--redacted--:8807
CONDA_DEFAULT_ENV=base
NV_CUDNN_PACKAGE=libcudnn8=8.2.4.15-1+cuda11.4
YARN_CONF_DIR=/etc/hadoop/conf
_=/usr/bin/env

Also this parameters are set in jupyter_server_config.py:

c.MappingKernelManager.kernel_info_timeout = 900
c.AsyncMappingKernelManager.kernel_info_timeout = 900
c.GatewayKernelManager.kernel_info_timeout = 900
c.AsyncMultiKernelManager.use_pending_kernels = True
c.GatewayKernelManager.use_pending_kernels = True
c.GatewayClient.connect_timeout = 900
c.GatewayClient.request_timeout = 900

I do confirm that when I start the kernel from Lab after it has been launched from Hub, kernel does start in Hadoop/Yarn: Screenshot 2022-02-08 at 07 13 38

But still I experience 60 seconds timeout after launching the kernel.

Also I would like to draw your attention to the fact that I access JupyterHub through nginx frontend. Can it be the reason of kernel starting timeout?

kevin-bates commented 2 years ago

@s3uzz - thanks for the update. The EG machinery appears to be working fine - which is consistent with the behavior you see w/o Hub in the picture.

Also I would like to draw your attention to the fact that I access JupyterHub through nginx frontend. Can it be the reason of kernel starting timeout?

Yes, this is what I was referring to in the gitter post response:

If you did need to increase this value, I suspect it needs to happen in the Hub configuration or some reverse proxy configuration that is in use. Once the server has been started, things like KERNEL_LAUNCH_TIMEOUT are between the Notebook server and EG during a kernel's start request. Since you find that machinery working independent of Hub (correct?), I would recommend pursuing the 60 second timeout issue from the Hub side of things.

This https://github.com/jupyter-server/enterprise_gateway/issues/683#issuecomment-504295374 also implies that nginx uses a 60 timeout by default.

david1155 commented 2 years ago

@kevin-bates Thank you very much. I have changed configuration for nginx and upgraded jupyterhub to 2.1.1.

But now I experience timeout exactly in 120 seconds after YARN kernel launch with error 502 Bad Gateway

Following configuration is set in nginx for jupyterhub host:

location / {
proxy_pass              http://--redacted--:8000;
include                 snippets/proxy_headers;
proxy_http_version      1.1;
proxy_set_header        Upgrade    $http_upgrade;
proxy_set_header        Connection $connection_upgrade;
proxy_set_header        X-Scheme   $scheme;
proxy_connect_timeout   600s;
proxy_read_timeout      600s;
proxy_send_timeout      600s;
proxy_buffering         off;
}

I kindly ask for advice.

kevin-bates commented 2 years ago

Hi @s3uzz - I'm sorry this is an area in which I'm just not familiar. Perhaps others have some ideas. You might try asking on the Jupyter community forum or Hub gitter channel, both referenced here.

akhileshram commented 2 years ago

Hi @s3uzz, noticed a similar issue - any chance you were able to get around the it?

david1155 commented 2 years ago

Hi @akhileshram. Unfortunately the issue is still unresolved.

kevin-bates commented 2 years ago

I took another pass through the information in this issue and see this kernel.json entry in the argv stanza from https://github.com/jupyter-server/enterprise_gateway/issues/1038#issue-1125589925

    "--RemoteProcessProxy.response-address",
    "--redacted--:8877",

typically, this entry will look like...

    "--RemoteProcessProxy.response-address",
    "{response_address}",

where {response_address} is substituted by the EG server when the kernel is launched.

I suspect the reason you're doing this is that the IP generated by EG, is a private IP, but because EG resides within a docker image, needs to be a differently scoped IP address that is accessible by processes outside the container. Is that a correct assumption?

Based on the dump of environment variables (btw, it would be really helpful to maintainers if these were sorted), I see that EG_RESPONSE_IP is not set. Here are the various EG_ variables from above...

EG_ALT_YARN_ENDPOINT=http://--redacted--:8088
EG_CONNECT_TIMEOUT=900
EG_CULL_CONNECTED=False
EG_CULL_IDLE_TIMEOUT=900
EG_CULL_INTERVAL=900
EG_IP=0.0.0.0
EG_KERNEL_LAUNCH_TIMEOUT=900
EG_MIRROR_WORKING_DIRS=True
EG_PORT=8807
EG_REQUEST_TIMEOUT=900
EG_RESPONSE_PORT=8877
EG_YARN_ENDPOINT=http://--redacted--:8088

This results in EG binding the response socket to its local (container) IP at port 8877 (by virtue of EG_RESPONSE_PORT). However, your remote kernel is attempting to send responses to --redacted--:8877 and this might be where the disconnect is occurring.

Instead, please try setting EG_RESPONSE_IP to the redacting IP and restore the kernel.json response address value to its templated form: {response_address}.

I still believe the ultimate timeout is occurring at some network boundary and it's that "controller's" (e.g., nginx, nodejs, etc.) default timeout value (or configured to that value). I did run across this post which references a NodeJS default timeout of 120000ms.

At any rate, I suspect that if the kernel launch is properly working, the timeout issue may not be applicable.

Also, I believe you had confirmed previously that EG to YARN was working when EG is NOT hosted within a container, but this issue is only occurring when EG runs within a container - correct?

kevin-bates commented 2 years ago

Closing as this appears to be configuration-related. We can re-visit should there be concrete information otherwise. Thank you.