Closed rgaiacs closed 2 months ago
Thank you for reporting :)
Some builds failed with
[E 240125 04:00:29 build:669] log event not json: 'Exception ignored in: <function Application.__del__ at 0x7f95c6954ca0>\n'
[E 240125 04:00:29 build:669] log event not json: 'Exception ignored in: <function Application.__del__ at 0x7f95c6954ca0>\n'
[E 240125 04:00:29 build:669] log event not json: 'Traceback (most recent call last):\n'
[E 240125 04:00:29 build:669] log event not json: 'Traceback (most recent call last):\n'
[E 240125 04:00:29 build:669] log event not json: ' File "/usr/lib/python3.10/site-packages/traitlets/config/application.py", line 1043, in __del__\n'
[E 240125 04:00:29 build:669] log event not json: ' File "/usr/lib/python3.10/site-packages/traitlets/config/application.py", line 1043, in __del__\n'
[E 240125 04:00:29 build:669] log event not json: ' File "/usr/lib/python3.10/site-packages/traitlets/config/application.py", line 1032, in close_handlers\n'
[E 240125 04:00:29 build:669] log event not json: ' File "/usr/lib/python3.10/site-packages/traitlets/traitlets.py", line 719, in __get__\n'
[E 240125 04:00:29 build:669] log event not json: ' File "/usr/lib/python3.10/site-packages/traitlets/config/application.py", line 1032, in close_handlers\n'
[E 240125 04:00:29 build:669] log event not json: ' File "/usr/lib/python3.10/site-packages/traitlets/traitlets.py", line 719, in __get__\n'
[E 240125 04:00:29 build:669] log event not json: "TypeError: 'NoneType' object is not callable\n"
[E 240125 04:00:29 build:669] log event not json: "TypeError: 'NoneType' object is not callable\n"
[I 240125 04:00:30 build:679] Finished streaming logs of build-binder-2dexamples-2dr-974603-02c-0e
[I 240125 04:00:30 build:679] Finished streaming logs of build-binder-2dexamples-2dr-974603-02c-0e
[W 240125 04:00:31 builder:161] Stream closed while handling /binder/build/gh/binder-examples/r/HEAD?build_token=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJleHAiOjE3MDYxNTU0OTQsImF1ZCI6ImdoL2JpbmRlci1leGFtcGxlcy9yL0hFQUQiLCJvcmlnaW4iOiJteWJpbmRlci5vcmcifQ.apED-jm0GH25vqpofjwU9AEf7_zmHnBB1SiJ5h2Vy7E
but this happens all the time.
Some callback error like
[E 240125 04:14:52 ioloop:770] Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOMainLoop object at 0x7f432af05310>>, <Task finished name='Task-1437
908' coro=<Launcher.launch.<locals>.handle_chunk() done, defined at /usr/local/lib/python3.11/site-packages/binderhub/launcher.py:279> exception=Finish()>)
Traceback (most recent call last):
File "/usr/local/lib/python3.11/site-packages/binderhub/builder.py", line 159, in emit
await self.flush()
tornado.iostream.StreamClosedError: Stream is closed
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.11/site-packages/tornado/ioloop.py", line 750, in _run_callback
ret = callback()
^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/tornado/ioloop.py", line 774, in _discard_future_result
future.result()
File "/usr/local/lib/python3.11/site-packages/binderhub/launcher.py", line 291, in handle_chunk
await event_callback(event)
File "/usr/local/lib/python3.11/site-packages/binderhub/builder.py", line 696, in handle_progress_event
await self.emit(
File "/usr/local/lib/python3.11/site-packages/binderhub/builder.py", line 163, in emit
raise Finish()
tornado.web.Finish
that also happens all the time.
What catched my attention was
[I 240125 04:16:00 build:640] Stopping watch of build-ruby-2dnumo-2dnumo-2dnarray-adfbde-8dd-4e
[W 240125 04:16:01 log:135] 405 HEAD /binder/ (anonymous@63.143.42.251) 2.68ms
[I 240125 04:16:03 health:152] Checking registry status
[I 240125 04:16:04 health:117] Health check check_docker_registry took 0.804s
[I 240125 04:16:10 build:748] Deleting Failed build build-binder-2dexamples-2dr-974603-02c-0e (repo=https://github.com/binder-examples/r)
[I 240125 04:16:10 build:748] Deleting Failed build build-hankcs-2dpyhanlp-d3e861-f09-8b (repo=https://github.com/hankcs/pyhanlp)
[I 240125 04:16:10 build:748] Deleting Failed build build-ruby-2dnumo-2dnumo-2dnarray-adfbde-8dd-4e (repo=https://github.com/ruby-numo/numo-narray)
[I 240125 04:16:10 build:782] Deleted 3/6 build pods
[I 240125 04:16:21 health:152] Checking registry status
[I 240125 04:16:22 health:117] Health check check_docker_registry took 0.923s
[I 240125 04:16:22 log:135] 200 GET /binder/ (anonymous@63.143.42.251) 3.71ms
[I 240125 04:16:38 health:152] Checking registry status
[I 240125 04:16:39 health:117] Health check check_docker_registry took 0.791s
[I 240125 04:16:54 health:152] Checking registry status
[I 240125 04:16:54 health:222] Getting pod statistics for gesis
[I 240125 04:16:55 health:117] Health check check_docker_registry took 0.808s
[I 240125 04:17:11 health:152] Checking registry status
[I 240125 04:17:12 health:117] Health check check_docker_registry took 0.839s
[I 240125 04:17:27 health:152] Checking registry status
[I 240125 04:17:28 health:117] Health check check_docker_registry took 0.901s
[I 240125 04:17:45 health:152] Checking registry status
[I 240125 04:17:46 health:117] Health check check_docker_registry took 0.803s
[I 240125 04:18:01 health:152] Checking registry status
[I 240125 04:18:01 health:222] Getting pod statistics for gesis
[I 240125 04:18:02 health:117] Health check check_docker_registry took 0.812s
[I 240125 04:18:18 health:152] Checking registry status
[I 240125 04:18:19 health:117] Health check check_docker_registry took 0.790s
[I 240125 04:18:36 health:152] Checking registry status
[I 240125 04:18:37 health:117] Health check check_docker_registry took 1.542s
[I 240125 04:18:37 log:135] 200 GET /binder/health (anonymous@34.71.121.177) 1543.64ms
[I 240125 04:18:37 log:135] 200 GET /binder/health (anonymous@34.71.121.177) 1527.70ms
[I 240125 04:18:37 log:135] 200 GET /binder/health (anonymous@35.239.35.132) 1449.17ms
[I 240125 04:18:53 health:152] Checking registry status
[I 240125 04:18:53 health:117] Health check check_docker_registry took 0.827s
[I 240125 04:19:10 health:152] Checking registry status
[I 240125 04:19:10 health:222] Getting pod statistics for gesis
[I 240125 04:19:11 health:117] Health check check_docker_registry took 0.838s
[I 240125 04:19:16 build:563] Started build build-gesiscss-2dorc2-2dtest-2dbuild-ced273-4a3-a6
For a couple of minutes, the only entry in the log is health check. And later the health check start failing
[I 240125 04:24:35 health:117] Health check check_jupyterhub_api took 3.011s
[E 240125 04:24:35 health:54] Error checking check_jupyterhub_api
Traceback (most recent call last):
File "/usr/local/lib/python3.11/site-packages/binderhub/health.py", line 52, in wrapper
res = await f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/binderhub/health.py", line 27, in wrapper
return await f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/binderhub/health.py", line 110, in wrapped
return await f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/binderhub/health.py", line 143, in check_jupyterhub_api
await AsyncHTTPClient().fetch(hub_url + "hub/api/health", request_timeout=3)
tornado.curl_httpclient.CurlError: HTTP 599: Operation timed out after 3000 milliseconds with 0 bytes received
[W 240125 04:24:35 health:196] Unhealthy services: [{'service': 'JupyterHub API', 'ok': False}]
[E 240125 04:24:35 log:135] 503 GET /binder/health (anonymous@35.239.35.132) 3014.89ms
[W 240125 04:24:35 health:196] Unhealthy services: [{'service': 'JupyterHub API', 'ok': False}]
[E 240125 04:24:35 log:135] 503 GET /binder/health (anonymous@34.71.121.177) 2025.55ms
[W 240125 04:24:35 health:196] Unhealthy services: [{'service': 'JupyterHub API', 'ok': False}]
[E 240125 04:24:35 log:135] 503 GET /binder/health (anonymous@35.239.35.132) 1852.56ms
[W 240125 04:24:36 health:196] Unhealthy services: [{'service': 'JupyterHub API', 'ok': False}]
[E 240125 04:24:36 log:135] 503 GET /binder/health (anonymous@35.239.35.132) 2.84ms
[W 240125 04:24:36 health:196] Unhealthy services: [{'service': 'JupyterHub API', 'ok': False}]
[E 240125 04:24:36 log:135] 503 GET /binder/health (anonymous@35.239.35.132) 3.73ms
[W 240125 04:24:36 health:196] Unhealthy services: [{'service': 'JupyterHub API', 'ok': False}]
[E 240125 04:24:36 log:135] 503 GET /binder/health (anonymous@34.71.121.177) 2.63ms
But, 2 minutes later, the health check starts work again.
[I 240125 04:26:08 health:152] Checking registry status
[I 240125 04:26:08 health:222] Getting pod statistics for gesis
[I 240125 04:26:09 health:117] Health check check_docker_registry took 0.865s
[I 240125 04:26:26 health:152] Checking registry status
[I 240125 04:26:27 health:117] Health check check_docker_registry took 0.790s
[W 240125 04:26:41 log:135] 405 HEAD /binder/ (anonymous@63.143.42.251) 2.82ms
Later, Hub API stop working:
[I 240125 04:39:27 health:222] Getting pod statistics for gesis
[E 240125 04:39:32 launcher:122] Error accessing Hub API (using https://notebooks.gesis.org/binder/jupyter/hub/api/users/hanlpbot-hanlp-binder-xiqbyxpi/server/progress): HTTP 599: HTTP/2 stream 0 was not closed cleanly
: INTERNAL_ERROR (err 2)
[E 240125 04:39:32 launcher:122] Error accessing Hub API (using https://notebooks.gesis.org/binder/jupyter/hub/api/users/scikit-learn-scikit-learn-2eekxsjd/server/progress): HTTP 599: HTTP/2 stream 0 was not closed cle
anly: INTERNAL_ERROR (err 2)
[E 240125 04:39:32 launcher:122] Error accessing Hub API (using https://notebooks.gesis.org/binder/jupyter/hub/api/users/ucb-ee290c-tape-chisel-bootcamp-ju49jgy7/server/progress): HTTP 599: HTTP/2 stream 0 was not clos
ed cleanly: INTERNAL_ERROR (err 2)
[E 240125 04:39:32 launcher:122] Error accessing Hub API (using https://notebooks.gesis.org/binder/jupyter/hub/api/users/geo-python-site-9ojs8hgc/server/progress): HTTP 599: HTTP/2 stream 0 was not closed cleanly: INTE
RNAL_ERROR (err 2)
[E 240125 04:39:32 launcher:122] Error accessing Hub API (using https://notebooks.gesis.org/binder/jupyter/hub/api/users/hanlpbot-hanlp-binder-mca1cur4/server/progress): HTTP 599: HTTP/2 stream 0 was not closed cleanly
: INTERNAL_ERROR (err 2)
[E 240125 04:39:32 launcher:122] Error accessing Hub API (using https://notebooks.gesis.org/binder/jupyter/hub/api/users/ucb-ee290c-tape-chisel-bootcamp-6bx9viqw/server/progress): HTTP 599: HTTP/2 stream 0 was not clos
ed cleanly: INTERNAL_ERROR (err 2)
[E 240125 04:39:32 launcher:122] Error accessing Hub API (using https://notebooks.gesis.org/binder/jupyter/hub/api/users/ipython-ipython-in-depth-ll82qv4v/server/progress): HTTP 599: HTTP/2 stream 0 was not closed clea
nly: INTERNAL_ERROR (err 2)
[E 240125 04:39:32 launcher:122] Error accessing Hub API (using https://notebooks.gesis.org/binder/jupyter/hub/api/users/ucb-ee290c-tape-chisel-bootcamp-i0bt3n4u/server/progress): HTTP 599: HTTP/2 stream 0 was not clos
ed cleanly: INTERNAL_ERROR (err 2)
[E 240125 04:39:32 launcher:122] Error accessing Hub API (using https://notebooks.gesis.org/binder/jupyter/hub/api/users/binder-examples-requirements-9qfd4sec/server/progress): HTTP 599: HTTP/2 stream 0 was not closed
cleanly: INTERNAL_ERROR (err 2)
[I 240125 04:39:32 health:117] Health check check_jupyterhub_api took 302.220s
@arnim I believe that something on the GESIS network might have caused this. I'm excluding something on the Kubernetes network because we didn't change anything.
I didn't found anything relevant.
This is old and I cannot look at the logs.
Around 5:00am UTC+1 of 25 January 2025, GESIS server started to accumulate build pods.
The increase number of build pods continuers from 5:00am UTC+1 until 7:30am UTC+1 when it abruptly dropped.
At the same time, number of user pods on GESIS server decreased.
GESIS server was NOT cordoned during this period.
And CPU load on GESIS server was average during this period.
The popular repositories before the incident were
GESIS server looks normal now. cc @arnim