jupyterhub / mybinder.org-deploy

Deployment config files for mybinder.org
https://mybinder-sre.readthedocs.io/en/latest/index.html
BSD 3-Clause "New" or "Revised" License
76 stars 74 forks source link

Incident - High number of builds on GESIS server on 25 January #2905

Closed rgaiacs closed 2 months ago

rgaiacs commented 9 months ago

Around 5:00am UTC+1 of 25 January 2025, GESIS server started to accumulate build pods.

Screenshot 2024-01-25 at 14-14-49 1  Overview - Dashboards - Grafana

The increase number of build pods continuers from 5:00am UTC+1 until 7:30am UTC+1 when it abruptly dropped.

At the same time, number of user pods on GESIS server decreased.

Screenshot 2024-01-25 at 14-12-21 1  Overview - Dashboards - Grafana

GESIS server was NOT cordoned during this period.

Screenshot 2024-01-25 at 14-24-05 Node Activity - Dashboards - Grafana

And CPU load on GESIS server was average during this period.

Screenshot 2024-01-25 at 14-24-37 Node Activity - Dashboards - Grafana

The popular repositories before the incident were

Screenshot 2024-01-25 at 14-28-16 View panel - Pod Activity - Dashboards - Grafana

GESIS server looks normal now. cc @arnim

arnim commented 9 months ago

Thank you for reporting :)

rgaiacs commented 9 months ago

Binder Pod

Some builds failed with

[E 240125 04:00:29 build:669] log event not json: 'Exception ignored in: <function Application.__del__ at 0x7f95c6954ca0>\n'
[E 240125 04:00:29 build:669] log event not json: 'Exception ignored in: <function Application.__del__ at 0x7f95c6954ca0>\n'
[E 240125 04:00:29 build:669] log event not json: 'Traceback (most recent call last):\n'
[E 240125 04:00:29 build:669] log event not json: 'Traceback (most recent call last):\n'
[E 240125 04:00:29 build:669] log event not json: '  File "/usr/lib/python3.10/site-packages/traitlets/config/application.py", line 1043, in __del__\n'
[E 240125 04:00:29 build:669] log event not json: '  File "/usr/lib/python3.10/site-packages/traitlets/config/application.py", line 1043, in __del__\n'
[E 240125 04:00:29 build:669] log event not json: '  File "/usr/lib/python3.10/site-packages/traitlets/config/application.py", line 1032, in close_handlers\n'
[E 240125 04:00:29 build:669] log event not json: '  File "/usr/lib/python3.10/site-packages/traitlets/traitlets.py", line 719, in __get__\n'
[E 240125 04:00:29 build:669] log event not json: '  File "/usr/lib/python3.10/site-packages/traitlets/config/application.py", line 1032, in close_handlers\n'
[E 240125 04:00:29 build:669] log event not json: '  File "/usr/lib/python3.10/site-packages/traitlets/traitlets.py", line 719, in __get__\n'
[E 240125 04:00:29 build:669] log event not json: "TypeError: 'NoneType' object is not callable\n"
[E 240125 04:00:29 build:669] log event not json: "TypeError: 'NoneType' object is not callable\n"
[I 240125 04:00:30 build:679] Finished streaming logs of build-binder-2dexamples-2dr-974603-02c-0e
[I 240125 04:00:30 build:679] Finished streaming logs of build-binder-2dexamples-2dr-974603-02c-0e
[W 240125 04:00:31 builder:161] Stream closed while handling /binder/build/gh/binder-examples/r/HEAD?build_token=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJleHAiOjE3MDYxNTU0OTQsImF1ZCI6ImdoL2JpbmRlci1leGFtcGxlcy9yL0hFQUQiLCJvcmlnaW4iOiJteWJpbmRlci5vcmcifQ.apED-jm0GH25vqpofjwU9AEf7_zmHnBB1SiJ5h2Vy7E

but this happens all the time.

Some callback error like

[E 240125 04:14:52 ioloop:770] Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOMainLoop object at 0x7f432af05310>>, <Task finished name='Task-1437
908' coro=<Launcher.launch.<locals>.handle_chunk() done, defined at /usr/local/lib/python3.11/site-packages/binderhub/launcher.py:279> exception=Finish()>)
    Traceback (most recent call last):
      File "/usr/local/lib/python3.11/site-packages/binderhub/builder.py", line 159, in emit
        await self.flush()
    tornado.iostream.StreamClosedError: Stream is closed

    During handling of the above exception, another exception occurred:

    Traceback (most recent call last):
      File "/usr/local/lib/python3.11/site-packages/tornado/ioloop.py", line 750, in _run_callback
        ret = callback()
              ^^^^^^^^^^
      File "/usr/local/lib/python3.11/site-packages/tornado/ioloop.py", line 774, in _discard_future_result
        future.result()
      File "/usr/local/lib/python3.11/site-packages/binderhub/launcher.py", line 291, in handle_chunk
        await event_callback(event)
      File "/usr/local/lib/python3.11/site-packages/binderhub/builder.py", line 696, in handle_progress_event
        await self.emit(
      File "/usr/local/lib/python3.11/site-packages/binderhub/builder.py", line 163, in emit
        raise Finish()
    tornado.web.Finish

that also happens all the time.

What catched my attention was

[I 240125 04:16:00 build:640] Stopping watch of build-ruby-2dnumo-2dnumo-2dnarray-adfbde-8dd-4e
[W 240125 04:16:01 log:135] 405 HEAD /binder/ (anonymous@63.143.42.251) 2.68ms
[I 240125 04:16:03 health:152] Checking registry status
[I 240125 04:16:04 health:117] Health check check_docker_registry took 0.804s
[I 240125 04:16:10 build:748] Deleting Failed build build-binder-2dexamples-2dr-974603-02c-0e (repo=https://github.com/binder-examples/r)
[I 240125 04:16:10 build:748] Deleting Failed build build-hankcs-2dpyhanlp-d3e861-f09-8b (repo=https://github.com/hankcs/pyhanlp)
[I 240125 04:16:10 build:748] Deleting Failed build build-ruby-2dnumo-2dnumo-2dnarray-adfbde-8dd-4e (repo=https://github.com/ruby-numo/numo-narray)
[I 240125 04:16:10 build:782] Deleted 3/6 build pods
[I 240125 04:16:21 health:152] Checking registry status
[I 240125 04:16:22 health:117] Health check check_docker_registry took 0.923s
[I 240125 04:16:22 log:135] 200 GET /binder/ (anonymous@63.143.42.251) 3.71ms
[I 240125 04:16:38 health:152] Checking registry status
[I 240125 04:16:39 health:117] Health check check_docker_registry took 0.791s
[I 240125 04:16:54 health:152] Checking registry status
[I 240125 04:16:54 health:222] Getting pod statistics for gesis
[I 240125 04:16:55 health:117] Health check check_docker_registry took 0.808s
[I 240125 04:17:11 health:152] Checking registry status
[I 240125 04:17:12 health:117] Health check check_docker_registry took 0.839s
[I 240125 04:17:27 health:152] Checking registry status
[I 240125 04:17:28 health:117] Health check check_docker_registry took 0.901s
[I 240125 04:17:45 health:152] Checking registry status
[I 240125 04:17:46 health:117] Health check check_docker_registry took 0.803s
[I 240125 04:18:01 health:152] Checking registry status
[I 240125 04:18:01 health:222] Getting pod statistics for gesis
[I 240125 04:18:02 health:117] Health check check_docker_registry took 0.812s
[I 240125 04:18:18 health:152] Checking registry status
[I 240125 04:18:19 health:117] Health check check_docker_registry took 0.790s
[I 240125 04:18:36 health:152] Checking registry status
[I 240125 04:18:37 health:117] Health check check_docker_registry took 1.542s
[I 240125 04:18:37 log:135] 200 GET /binder/health (anonymous@34.71.121.177) 1543.64ms
[I 240125 04:18:37 log:135] 200 GET /binder/health (anonymous@34.71.121.177) 1527.70ms
[I 240125 04:18:37 log:135] 200 GET /binder/health (anonymous@35.239.35.132) 1449.17ms
[I 240125 04:18:53 health:152] Checking registry status
[I 240125 04:18:53 health:117] Health check check_docker_registry took 0.827s
[I 240125 04:19:10 health:152] Checking registry status
[I 240125 04:19:10 health:222] Getting pod statistics for gesis
[I 240125 04:19:11 health:117] Health check check_docker_registry took 0.838s
[I 240125 04:19:16 build:563] Started build build-gesiscss-2dorc2-2dtest-2dbuild-ced273-4a3-a6

For a couple of minutes, the only entry in the log is health check. And later the health check start failing

[I 240125 04:24:35 health:117] Health check check_jupyterhub_api took 3.011s
[E 240125 04:24:35 health:54] Error checking check_jupyterhub_api
    Traceback (most recent call last):
      File "/usr/local/lib/python3.11/site-packages/binderhub/health.py", line 52, in wrapper
        res = await f(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^
      File "/usr/local/lib/python3.11/site-packages/binderhub/health.py", line 27, in wrapper
        return await f(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^^^^
      File "/usr/local/lib/python3.11/site-packages/binderhub/health.py", line 110, in wrapped
        return await f(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^^^^
      File "/usr/local/lib/python3.11/site-packages/binderhub/health.py", line 143, in check_jupyterhub_api
        await AsyncHTTPClient().fetch(hub_url + "hub/api/health", request_timeout=3)
    tornado.curl_httpclient.CurlError: HTTP 599: Operation timed out after 3000 milliseconds with 0 bytes received
[W 240125 04:24:35 health:196] Unhealthy services: [{'service': 'JupyterHub API', 'ok': False}]
[E 240125 04:24:35 log:135] 503 GET /binder/health (anonymous@35.239.35.132) 3014.89ms
[W 240125 04:24:35 health:196] Unhealthy services: [{'service': 'JupyterHub API', 'ok': False}]
[E 240125 04:24:35 log:135] 503 GET /binder/health (anonymous@34.71.121.177) 2025.55ms
[W 240125 04:24:35 health:196] Unhealthy services: [{'service': 'JupyterHub API', 'ok': False}]
[E 240125 04:24:35 log:135] 503 GET /binder/health (anonymous@35.239.35.132) 1852.56ms
[W 240125 04:24:36 health:196] Unhealthy services: [{'service': 'JupyterHub API', 'ok': False}]
[E 240125 04:24:36 log:135] 503 GET /binder/health (anonymous@35.239.35.132) 2.84ms
[W 240125 04:24:36 health:196] Unhealthy services: [{'service': 'JupyterHub API', 'ok': False}]
[E 240125 04:24:36 log:135] 503 GET /binder/health (anonymous@35.239.35.132) 3.73ms
[W 240125 04:24:36 health:196] Unhealthy services: [{'service': 'JupyterHub API', 'ok': False}]
[E 240125 04:24:36 log:135] 503 GET /binder/health (anonymous@34.71.121.177) 2.63ms

But, 2 minutes later, the health check starts work again.

[I 240125 04:26:08 health:152] Checking registry status
[I 240125 04:26:08 health:222] Getting pod statistics for gesis
[I 240125 04:26:09 health:117] Health check check_docker_registry took 0.865s
[I 240125 04:26:26 health:152] Checking registry status
[I 240125 04:26:27 health:117] Health check check_docker_registry took 0.790s
[W 240125 04:26:41 log:135] 405 HEAD /binder/ (anonymous@63.143.42.251) 2.82ms

Later, Hub API stop working:

[I 240125 04:39:27 health:222] Getting pod statistics for gesis
[E 240125 04:39:32 launcher:122] Error accessing Hub API (using https://notebooks.gesis.org/binder/jupyter/hub/api/users/hanlpbot-hanlp-binder-xiqbyxpi/server/progress): HTTP 599: HTTP/2 stream 0 was not closed cleanly
: INTERNAL_ERROR (err 2)
[E 240125 04:39:32 launcher:122] Error accessing Hub API (using https://notebooks.gesis.org/binder/jupyter/hub/api/users/scikit-learn-scikit-learn-2eekxsjd/server/progress): HTTP 599: HTTP/2 stream 0 was not closed cle
anly: INTERNAL_ERROR (err 2)
[E 240125 04:39:32 launcher:122] Error accessing Hub API (using https://notebooks.gesis.org/binder/jupyter/hub/api/users/ucb-ee290c-tape-chisel-bootcamp-ju49jgy7/server/progress): HTTP 599: HTTP/2 stream 0 was not clos
ed cleanly: INTERNAL_ERROR (err 2)
[E 240125 04:39:32 launcher:122] Error accessing Hub API (using https://notebooks.gesis.org/binder/jupyter/hub/api/users/geo-python-site-9ojs8hgc/server/progress): HTTP 599: HTTP/2 stream 0 was not closed cleanly: INTE
RNAL_ERROR (err 2)
[E 240125 04:39:32 launcher:122] Error accessing Hub API (using https://notebooks.gesis.org/binder/jupyter/hub/api/users/hanlpbot-hanlp-binder-mca1cur4/server/progress): HTTP 599: HTTP/2 stream 0 was not closed cleanly
: INTERNAL_ERROR (err 2)
[E 240125 04:39:32 launcher:122] Error accessing Hub API (using https://notebooks.gesis.org/binder/jupyter/hub/api/users/ucb-ee290c-tape-chisel-bootcamp-6bx9viqw/server/progress): HTTP 599: HTTP/2 stream 0 was not clos
ed cleanly: INTERNAL_ERROR (err 2)
[E 240125 04:39:32 launcher:122] Error accessing Hub API (using https://notebooks.gesis.org/binder/jupyter/hub/api/users/ipython-ipython-in-depth-ll82qv4v/server/progress): HTTP 599: HTTP/2 stream 0 was not closed clea
nly: INTERNAL_ERROR (err 2)
[E 240125 04:39:32 launcher:122] Error accessing Hub API (using https://notebooks.gesis.org/binder/jupyter/hub/api/users/ucb-ee290c-tape-chisel-bootcamp-i0bt3n4u/server/progress): HTTP 599: HTTP/2 stream 0 was not clos
ed cleanly: INTERNAL_ERROR (err 2)
[E 240125 04:39:32 launcher:122] Error accessing Hub API (using https://notebooks.gesis.org/binder/jupyter/hub/api/users/binder-examples-requirements-9qfd4sec/server/progress): HTTP 599: HTTP/2 stream 0 was not closed 
cleanly: INTERNAL_ERROR (err 2)
[I 240125 04:39:32 health:117] Health check check_jupyterhub_api took 302.220s

@arnim I believe that something on the GESIS network might have caused this. I'm excluding something on the Kubernetes network because we didn't change anything.

Build Pod

I didn't found anything relevant.

rgaiacs commented 2 months ago

This is old and I cannot look at the logs.