jupyterhub / mybinder.org-deploy

Deployment config files for mybinder.org
https://mybinder-sre.readthedocs.io/en/latest/index.html
BSD 3-Clause "New" or "Revised" License
76 stars 75 forks source link

OVH hub restarting ~2000 times in ~20 days - 5 consecutive failed startups #2642

Open consideRatio opened 1 year ago

consideRatio commented 1 year ago

image

[W 2023-05-24 18:26:21.361 JupyterHub base:1030] 4 consecutive spawns failed.  Hub will exit if failure count reaches 5 before succeeding
[E 2023-05-24 18:26:21.361 JupyterHub gen:630] Exception in Future <Task finished name='Task-11612' coro=<BaseHandler.spawn_single_user.<locals>.finish_user_spawn() done, defined at /usr/local/lib/python3.9/site-packages/jupyterhub/handlers/base.py:954> exception=TimeoutError('Timeout')> after timeout
    Traceback (most recent call last):
      File "/usr/local/lib/python3.9/site-packages/tornado/gen.py", line 625, in error_callback
        future.result()
      File "/usr/local/lib/python3.9/site-packages/jupyterhub/handlers/base.py", line 961, in finish_user_spawn
        await spawn_future
      File "/usr/local/lib/python3.9/site-packages/jupyterhub/user.py", line 850, in spawn
        raise e
      File "/usr/local/lib/python3.9/site-packages/jupyterhub/user.py", line 747, in spawn
        url = await gen.with_timeout(timedelta(seconds=spawner.start_timeout), f)
    asyncio.exceptions.TimeoutError: Timeout

[I 2023-05-24 18:26:21.362 JupyterHub log:186] 200 GET /hub/api/users/eviljasi-arbeitspaket11-c0shamc0/server/progress (binder@141.94.214.128) 430943.94ms
[C 2023-05-24 18:26:21.508 JupyterHub base:1037] Aborting due to 5 consecutive spawn failures
[E 2023-05-24 18:26:21.508 JupyterHub gen:630] Exception in Future <Task finished name='Task-12697' coro=<BaseHandler.spawn_single_user.<locals>.finish_user_spawn() done, defined at /usr/local/lib/python3.9/site-packages/jupyterhub/handlers/base.py:954> exception=TimeoutError('Timeout')> after timeout
    Traceback (most recent call last):
      File "/usr/local/lib/python3.9/site-packages/tornado/gen.py", line 625, in error_callback
        future.result()
      File "/usr/local/lib/python3.9/site-packages/jupyterhub/handlers/base.py", line 961, in finish_user_spawn
        await spawn_future
      File "/usr/local/lib/python3.9/site-packages/jupyterhub/user.py", line 850, in spawn
        raise e
      File "/usr/local/lib/python3.9/site-packages/jupyterhub/user.py", line 747, in spawn
        url = await gen.with_timeout(timedelta(seconds=spawner.start_timeout), f)
    asyncio.exceptions.TimeoutError: Timeout

[I 2023-05-24 18:26:21.509 JupyterHub log:186] 200 GET /hub/api/users/jupyterlab-jupyterlab-demo-a6r1ksnc/server/progress (binder@141.94.214.128) 344035.05ms
[I 2023-05-24 18:26:22.266 JupyterHub roles:238] Adding role user for User: jupyterlab-jupyterlab-demo-zz2ecopn
[I 2023-05-24 18:26:22.295 JupyterHub log:186] 201 POST /hub/api/users/jupyterlab-jupyterlab-demo-zz2ecopn (binder@141.94.214.128) 43.89ms
[I 2023-05-24 18:26:22.349 JupyterHub provider:651] Creating oauth client jupyterhub-user-jupyterlab-jupyterlab-demo-zz2ecopn
[W 2023-05-24 18:26:22.391 JupyterHub spawner:3071] Ignoring unrecognized KubeSpawner user_options: binder_launch_host, binder_persistent_request, binder_ref_url, binder_request, image, repo_url, token
[W 2023-05-24 18:26:22.410 JupyterHub utils:77] 'pod.spec.restart_policy' current value: 'OnFailure' is overridden with 'Never', which is the value of 'extra_pod_config.restart_policy'.
[I 2023-05-24 18:26:22.411 JupyterHub log:186] 202 POST /hub/api/users/jupyterlab-jupyterlab-demo-zz2ecopn/servers/ (binder@141.94.214.128) 106.82ms
[I 2023-05-24 18:26:22.411 JupyterHub spawner:2469] Attempting to create pod jupyter-jupyterlab-2djupyterlab-2ddemo-2dzz2ecopn, with timeout 3
Task was destroyed but it is pending!
task: <Task pending name='Task-3' coro=<shared_client.<locals>.close_client_task() running at /usr/local/lib/python3.9/site-packages/kubespawner/clients.py:58> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7fac349b91f0>()]>>
Exception ignored in: <coroutine object shared_client.<locals>.close_client_task at 0x7fac35dab440>
RuntimeError: coroutine ignored GeneratorExit
consideRatio commented 1 year ago

It seems that a lot of pod is stucking pulling the image without erroring or succeeding. Even pods in a terminating state aren't terminating because they are stuck pulling still.

kubectl describe pod jupyter-binderhub-2dci-2dre-2dimal-2ddockerfile-2di190dym9

Events:
  Type    Reason     Age    From                 Message
  ----    ------     ----   ----                 -------
  Normal  Scheduled  2m38s  ovh2-user-scheduler  Successfully assigned ovh2/jupyter-binderhub-2dci-2dre-2dimal-2ddockerfile-2di190dym9 to user-202211a-node-6f699a
  Normal  Pulled     2m38s  kubelet              Container image "jupyterhub/mybinder.org-tc-init:2020.12.4-0.dev.git.4289.h140cef52" already present on machine
  Normal  Created    2m38s  kubelet              Created container tc-init
  Normal  Started    2m37s  kubelet              Started container tc-init
  Normal  Pulling    2m37s  kubelet              Pulling image "2lmrrh8f.gra7.container-registry.ovh.net/mybinder-builds/r2d-g5b5b759binderhub-2dci-2drepos-2dcached-2dminimal-2ddockerfile-c90b2b:596b52f10efb0c9befc0c4ae850cc5175297d71c"
minrk commented 1 year ago

OVH harbor registry appears to be having stability issues again, which I think is the ultimate cause. I've contacted OVH support about it.

I think we should consier moving OVH to using an external registry on e.g. quay.io. Downside: images are public and we need to be more proactive about cleaning and better support requesting deletion because e.g. statements about "if you unpublish the ref, your files are inaccessible" are not true at all if the build cache is public.