jupyterhub / zero-to-jupyterhub-k8s

Helm Chart & Documentation for deploying JupyterHub on Kubernetes
https://zero-to-jupyterhub.readthedocs.io
Other
1.52k stars 791 forks source link

Hub pod not restarting after cleaning up #1051

Closed dalonlobo closed 3 years ago

dalonlobo commented 5 years ago

Hi Team, the hub pod get hung, does not load sometimes after the following message in the log:

image

I need to delete the hub pod and it works again.

Thanks for your help in advance.

dalonlobo commented 5 years ago

Attaching bit more of Traceback from the hub:


[I 2018-12-06 08:58:32.517 JupyterHub log:158] 200 GET /hub/api/users (cull-idle@127.0.0.1) 59.68ms                                                                                                       [238/575]
[W 181206 08:58:32 cull_idle_servers:128] Not culling server gayathri.ramanathan1 with pending spawn                                                                                                               [I 2018-12-06 08:58:45.977 JupyterHub base:951] Admin dalon requesting spawn on behalf of gayathri.ramanathan1
[I 2018-12-06 08:58:45.978 JupyterHub base:1012] Pending spawn for gayathri.ramanathan1 didn't finish in 0.0 seconds
[I 2018-12-06 08:58:45.978 JupyterHub base:1018] gayathri.ramanathan1 is pending spawn
[I 2018-12-06 08:59:32.204 JupyterHub proxy:301] Checking routes
[I 2018-12-06 09:00:32.209 JupyterHub proxy:301] Checking routes
[I 2018-12-06 09:00:36.243 JupyterHub base:951] Admin dalon requesting spawn on behalf of gayathri.ramanathan1
[I 2018-12-06 09:00:36.245 JupyterHub base:1012] Pending spawn for gayathri.ramanathan1 didn't finish in 0.0 seconds
[I 2018-12-06 09:00:36.245 JupyterHub base:1018] gayathri.ramanathan1 is pending spawn

[W 2018-12-06 09:01:09.499 JupyterHub base:714] User rajeev.arora is slow to start (timeout=0)
[I 2018-12-06 09:01:09.501 JupyterHub log:158] 202 POST /hub/api/users/rajeev.arora/server (dalon@10.240.0.10) 1053.87ms
[W 2018-12-06 09:01:31.059 JupyterHub user:468] gayathri.ramanathan1's server failed to start in 240 seconds, giving up
[E 2018-12-06 09:01:31.149 JupyterHub gen:974] Exception in Future <Task finished coro=<BaseHandler.spawn_single_user.<locals>.finish_user_spawn() done, defined at /usr/local/lib/python3.6/dist-packages/jupyterh
ub/handlers/base.py:619> exception=TimeoutError('Timeout',)> after timeout
    Traceback (most recent call last):
      File "/usr/local/lib/python3.6/dist-packages/tornado/gen.py", line 970, in error_callback
        future.result()
      File "/usr/local/lib/python3.6/dist-packages/jupyterhub/handlers/base.py", line 626, in finish_user_spawn
        await spawn_future
      File "/usr/local/lib/python3.6/dist-packages/jupyterhub/user.py", line 486, in spawn
        raise e
      File "/usr/local/lib/python3.6/dist-packages/jupyterhub/user.py", line 406, in spawn
        url = await gen.with_timeout(timedelta(seconds=spawner.start_timeout), f)
    tornado.util.TimeoutError: Timeout

[E 2018-12-06 09:01:31.151 JupyterHub gen:974] Exception in Future <Task finished coro=<BaseHandler.spawn_single_user.<locals>.finish_user_spawn() done, defined at /usr/local/lib/python3.6/dist-packages/jupyterh
ub/handlers/base.py:619> exception=TimeoutError('Timeout',)> after timeout
    Traceback (most recent call last):
      File "/usr/local/lib/python3.6/dist-packages/tornado/gen.py", line 970, in error_callback
        future.result()
      File "/usr/local/lib/python3.6/dist-packages/tornado/gen.py", line 970, in error_callback
        future.result()
      File "/usr/local/lib/python3.6/dist-packages/jupyterhub/handlers/base.py", line 626, in finish_user_spawn
        await spawn_future
      File "/usr/local/lib/python3.6/dist-packages/jupyterhub/user.py", line 486, in spawn
        raise e
      File "/usr/local/lib/python3.6/dist-packages/jupyterhub/user.py", line 406, in spawn
        url = await gen.with_timeout(timedelta(seconds=spawner.start_timeout), f)
    tornado.util.TimeoutError: Timeout

[E 2018-12-06 09:01:31.151 JupyterHub gen:974] Exception in Future <Task finished coro=<BaseHandler.spawn_single_user.<locals>.finish_user_spawn() done, defined at /usr/local/lib/python3.6/dist-packages/jupyterh
ub/handlers/base.py:619> exception=TimeoutError('Timeout',)> after timeout
    Traceback (most recent call last):
      File "/usr/local/lib/python3.6/dist-packages/tornado/gen.py", line 970, in error_callback

.........

ub/handlers/base.py:619> exception=TimeoutError('Timeout',)> after timeout                                                                                                                                [192/575]    Traceback (most recent call last):
      File "/usr/local/lib/python3.6/dist-packages/tornado/gen.py", line 970, in error_callback
        future.result()
      File "/usr/local/lib/python3.6/dist-packages/tornado/gen.py", line 970, in error_callback
        future.result()
      File "/usr/local/lib/python3.6/dist-packages/tornado/gen.py", line 970, in error_callback
        future.result()
      File "/usr/local/lib/python3.6/dist-packages/jupyterhub/handlers/base.py", line 626, in finish_user_spawn
        await spawn_future
      File "/usr/local/lib/python3.6/dist-packages/jupyterhub/user.py", line 486, in spawn
        raise e
      File "/usr/local/lib/python3.6/dist-packages/jupyterhub/user.py", line 406, in spawn
        url = await gen.with_timeout(timedelta(seconds=spawner.start_timeout), f)
    tornado.util.TimeoutError: Timeout

[E 2018-12-06 09:01:31.152 JupyterHub gen:974] Exception in Future <Task finished coro=<BaseHandler.spawn_single_user.<locals>.finish_user_spawn() done, defined at /usr/local/lib/python3.6/dist-packages/jupyterh
ub/handlers/base.py:619> exception=TimeoutError('Timeout',)> after timeout
    Traceback (most recent call last):
      File "/usr/local/lib/python3.6/dist-packages/tornado/gen.py", line 970, in error_callback
        future.result()
      File "/usr/local/lib/python3.6/dist-packages/tornado/gen.py", line 970, in error_callback
        future.result()
      File "/usr/local/lib/python3.6/dist-packages/tornado/gen.py", line 970, in error_callback
        future.result()
      File "/usr/local/lib/python3.6/dist-packages/jupyterhub/handlers/base.py", line 626, in finish_user_spawn
        await spawn_future
      File "/usr/local/lib/python3.6/dist-packages/jupyterhub/user.py", line 486, in spawn
        raise e
      File "/usr/local/lib/python3.6/dist-packages/jupyterhub/user.py", line 406, in spawn
        url = await gen.with_timeout(timedelta(seconds=spawner.start_timeout), f)
    tornado.util.TimeoutError: Timeout

[W 2018-12-06 09:01:31.153 JupyterHub users:439] Stream closed while handling /hub/api/users/gayathri.ramanathan1/server/progress
[W 2018-12-06 09:01:31.153 JupyterHub users:439] Stream closed while handling /hub/api/users/gayathri.ramanathan1/server/progress
[I 2018-12-06 09:01:31.154 JupyterHub log:158] 200 GET /hub/api/users/gayathri.ramanathan1/server/progress (dalon@10.240.0.10) 237295.12ms
[I 2018-12-06 09:01:31.154 JupyterHub log:158] 200 GET /hub/api/users/gayathri.ramanathan1/server/progress (dalon@10.240.0.10) 164477.42ms
[I 2018-12-06 09:01:31.156 JupyterHub log:158] 200 GET /hub/api/users/gayathri.ramanathan1/server/progress (dalon@10.240.0.10) 54213.62ms
[I 2018-12-06 09:01:32.205 JupyterHub proxy:301] Checking routes

[I 2018-12-06 09:02:02.752 JupyterHub base:951] Admin dalon requesting spawn on behalf of gayathri.ramanathan1
[E 2018-12-06 09:02:02.753 JupyterHub base:1001] Preventing implicit spawn for gayathri.ramanathan1 because last spawn failed: Timeout

......

[I 2018-12-06 09:03:06.522 JupyterHub log:158] 200 GET /hub/admin (dalon@10.240.0.10) 43.51ms                                                                                                                      [W 2018-12-06 09:03:09.847 JupyterHub base:714] User satadru.kundu is slow to start (timeout=0)
[I 2018-12-06 09:03:09.851 JupyterHub log:158] 202 POST /hub/api/users/satadru.kundu/server (dalon@10.240.0.10) 142.19ms
[I 2018-12-06 09:03:11.395 JupyterHub log:158] 302 GET /user/satadru.kundu/ -> /hub/user/satadru.kundu/ (@10.240.0.10) 0.64ms
[I 2018-12-06 09:03:11.694 JupyterHub base:951] Admin dalon requesting spawn on behalf of satadru.kundu
[I 2018-12-06 09:03:11.695 JupyterHub base:1012] Pending spawn for satadru.kundu didn't finish in 0.0 seconds
[I 2018-12-06 09:03:11.695 JupyterHub base:1018] satadru.kundu is pending spawn
[I 2018-12-06 09:03:11.697 JupyterHub log:158] 200 GET /hub/user/satadru.kundu/ (dalon@10.240.0.10) 32.79ms
[I 2018-12-06 09:03:32.205 JupyterHub proxy:301] Checking routes
[I 2018-12-06 09:04:32.205 JupyterHub proxy:301] Checking routes
[W 2018-12-06 09:05:09.500 JupyterHub user:468] rajeev.arora's server failed to start in 240 seconds, giving up
[W 2018-12-06 09:05:09.573 JupyterHub base:689] 2 consecutive spawns failed.  Hub will exit if failure count reaches 3 before succeeding
[E 2018-12-06 09:05:09.573 JupyterHub gen:974] Exception in Future <Task finished coro=<BaseHandler.spawn_single_user.<locals>.finish_user_spawn() done, defined at /usr/local/lib/python3.6/dist-packages/jupyter$ub/handlers/base.py:619> exception=TimeoutError('Timeout',)> after timeout
    Traceback (most recent call last):
      File "/usr/local/lib/python3.6/dist-packages/tornado/gen.py", line 970, in error_callback
        future.result()
      File "/usr/local/lib/python3.6/dist-packages/jupyterhub/handlers/base.py", line 626, in finish_user_spawn
        await spawn_future
      File "/usr/local/lib/python3.6/dist-packages/jupyterhub/user.py", line 486, in spawn
        raise e
      File "/usr/local/lib/python3.6/dist-packages/jupyterhub/user.py", line 406, in spawn
        url = await gen.with_timeout(timedelta(seconds=spawner.start_timeout), f)
    tornado.util.TimeoutError: Timeout

[I 2018-12-06 09:05:32.205 JupyterHub proxy:301] Checking routes
[W 2018-12-06 09:06:28.994 JupyterHub user:468] gayathri.ramanathan1's server failed to start in 240 seconds, giving up
[C 2018-12-06 09:06:29.048 JupyterHub base:693] Aborting due to 3 consecutive spawn failures
[E 2018-12-06 09:06:29.048 JupyterHub gen:974] Exception in Future <Task finished coro=<BaseHandler.spawn_single_user.<locals>.finish_user_spawn() done, defined at /usr/local/lib/python3.6/dist-packages/jupyter$ub/handlers/base.py:619> exception=TimeoutError('Timeout',)> after timeout
    Traceback (most recent call last):
      File "/usr/local/lib/python3.6/dist-packages/tornado/gen.py", line 970, in error_callback
        future.result()
      File "/usr/local/lib/python3.6/dist-packages/jupyterhub/handlers/base.py", line 626, in finish_user_spawn
        await spawn_future
      File "/usr/local/lib/python3.6/dist-packages/jupyterhub/user.py", line 486, in spawn
        raise e
      File "/usr/local/lib/python3.6/dist-packages/jupyterhub/user.py", line 406, in spawn
        url = await gen.with_timeout(timedelta(seconds=spawner.start_timeout), f)
    tornado.util.TimeoutError: Timeout

[E 2018-12-06 09:06:29.049 JupyterHub gen:974] Exception in Future <Task finished coro=<BaseHandler.spawn_single_user.<locals>.finish_user_spawn() done, defined at /usr/local/lib/python3.6/dist-packages/jupyter$
ub/handlers/base.py:619> exception=TimeoutError('Timeout',)> after timeout
    Traceback (most recent call last):
      File "/usr/local/lib/python3.6/dist-packages/tornado/gen.py", line 970, in error_callback
        future.result()
      File "/usr/local/lib/python3.6/dist-packages/tornado/gen.py", line 970, in error_callback
        future.result()
      File "/usr/local/lib/python3.6/dist-packages/jupyterhub/handlers/base.py", line 626, in finish_user_spawn
        await spawn_future
      File "/usr/local/lib/python3.6/dist-packages/jupyterhub/user.py", line 486, in spawn
        raise e
      File "/usr/local/lib/python3.6/dist-packages/jupyterhub/user.py", line 406, in spawn
        url = await gen.with_timeout(timedelta(seconds=spawner.start_timeout), f)
    tornado.util.TimeoutError: Timeout

[W 2018-12-06 09:06:29.050 JupyterHub users:439] Stream closed while handling /hub/api/users/gayathri.ramanathan1/server/progress
[I 2018-12-06 09:06:29.050 JupyterHub log:158] 200 GET /hub/api/users/gayathri.ramanathan1/server/progress (dalon@10.240.0.10) 236791.70ms
[I 2018-12-06 09:06:31.049 JupyterHub app:1695] Cleaning up 1 services...
[I 2018-12-06 09:06:31.116 JupyterHub app:1707] Leaving single-user servers running
[I 2018-12-06 09:06:31.116 JupyterHub app:1715] I didn't start the proxy, I can't clean it up
[I 2018-12-06 09:06:31.116 JupyterHub app:1733] ...done```
dalonlobo commented 5 years ago

I'm using AKS, cluster version 1.11.5

eric-leblouch commented 5 years ago

Got the same issue also on AKS 1.11.5

Needed to make to trigger the server and wait for the timeout twice before it could be started again. It happened after the automatic culling of the notebook, the jupyter instance stays up for some days after the culling. Deleting the pod or asking the hub to start the pod and then waiting for the timeout twice always has allowed me to start the pod at the next try, but users do not like to loose 15 minutes every morning to start their session again.

dalonlobo commented 5 years ago

@eric-leblouch I used a lower cluster version, and it works like a charm no. AKS 1.8.14

eric-leblouch commented 5 years ago

@dalonlobo Thanks for the workaround, but since the JupyterHub is not the only chart installed and that some other charts needs to use a version 1.10 or more, I cannot revert to an older version right now. And I would rather not use two different AKS since I have made some integrations between the charts using internal services that I do not want to publish outside the cluster.

consideRatio commented 3 years ago

Ah, AKS stuff yikes, update to 0.9.0-n311.ha2c3ef80 or more recent then I think this is resolved with kubespawner 0.14.0 that improved some parts to be more reliable.