googlefonts / fontbakery-dashboard

A library-scale web dashboard for Font Bakery, no longer developed
Apache License 2.0
28 stars 10 forks source link

Stability issues after cluster was completely restarted by the provider. #142

Open graphicore opened 4 years ago

graphicore commented 4 years ago

A lot of pods didn't communicate well after the restart. There were database connection issues etc. The kubernetes CrashLoopBackOff mechanism didn't bring the cluster into a stable state, so I guess some of the services need to do a proper self test after starting up and die when they are not well connected.

Also, if a connection goes bad after a node initialized successfully, i.e. after initialization, it should shut down itself as well, gracefully waiting for all running tasks to fail or finish before that.

Here's a log of the fontbakery-manifest-csvupstream service struggling to provide files packes after this incident:

fbd_csvupstream_crashes.txt

Another annoyance was the font bakery reports cache that uses the ENVIRONMENT_VERSION variable. I had to reset that variable and restart the fontbakery-init-workers service to get font bakery workers executing for some processes. Caching is not a bad idea here, but that was a good example how it can go wrong! example procees and its stuck Font Bakery report. Timeouts for workers would be nice here too, especially if we could free the stuck Font Bakery reports as well.

graphicore commented 4 years ago

Here's another related Error in the fontbakery-github-operations service. The job was fine and just reporting back the result ( _p._sendDispatchResult) via the rabbitmq queue. I guess a big problem we had with this restart is that the rabbitmq server was down and the clients lost the connection, but they only just realize when they are about to send a result ...

DEBUG _push DONE
DEBUG _gitHubPR: graphicore:Font_Bakery_Dispatcher_2019_10_11_ofl_abeezee => graphicore/googleFonts master
(node:1) UnhandledPromiseRejectionWarning: IllegalOperationError: Channel closed
    at Channel.<anonymous> (/var/javascript/node_modules/amqplib/lib/channel.js:160:11)
    at Channel.C._rpc (/var/javascript/node_modules/amqplib/lib/channel.js:142:8)
    at /var/javascript/node_modules/amqplib/lib/channel_model.js:59:17
    at tryCatcher (/var/javascript/node_modules/bluebird/js/release/util.js:16:23)
    at Function.Promise.fromNode.Promise.fromCallback (/var/javascript/node_modules/bluebird/js/release/promise.js:185:30)
    at Channel.C.rpc (/var/javascript/node_modules/amqplib/lib/channel_model.js:58:18)
    at Channel.C.assertQueue (/var/javascript/node_modules/amqplib/lib/channel_model.js:86:15)
    at IOOperations._p.sendQueueMessage (/var/javascript/node/util/IOOperations.js:123:30)
    at GitHubOperationsServer._p._sendDispatchResult (/var/javascript/node/GitHubOperationsServer.js:449:21)
    at Promise.all.then.then.then.report (/var/javascript/node/GitHubOperationsServer.js:431:24)
    at processTicksAndRejections (internal/process/task_queues.js:86:5)
(node:1) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 1)

the stuck process

graphicore commented 4 years ago

If it's the rabbitmq service we can provoke this problem when we restart it from a running, stable state. We should see similar problems. Usually, that service is rock solid and never needs a restart, also because it's hardly updated.

graphicore commented 4 years ago

And indeed therabbitmq pod age is 5h33m and github-operations is 5h47m, so rabbitmq was restarted after github-operations was online.

(pods much younger in the example below were re-initialized by me kubectl delete pod ...)

> kubectl get pods
NAME                                               READY   STATUS    RESTARTS   AGE
fontbakery-api-6bb4b7789-r5cxl                     1/1     Running   3          5h30m
fontbakery-dispatcher-6d4c66cc94-r4gdj             1/1     Running   0          63m
fontbakery-github-auth-558594dc5c-p2mjb            1/1     Running   0          5h30m
fontbakery-github-operations-68f5f75d8c-s2hbd      1/1     Running   0          5h47m
fontbakery-init-workers-78c7578866-fbcd9           1/1     Running   0          49m
fontbakery-manifest-csvupstream-7bfc84f4fb-wtn5c   1/1     Running   0          93m
fontbakery-manifest-gfapi-7d9c898fcb-lf6xz         1/1     Running   0          5h47m
fontbakery-manifest-githubgf-6795f6c9df-7tqn7      1/1     Running   0          5h30m
fontbakery-manifest-master-5c69ccccb-8ml9f         1/1     Running   0          53m
fontbakery-reports-c89b49674-j7wtp                 1/1     Running   3          5h30m
fontbakery-storage-cache-fb54d56cf-r4pm9           1/1     Running   0          5h47m
fontbakery-storage-persistence-66d87689df-7qj84    1/1     Running   0          5h51m
fontbakery-worker-dfc84586b-9cpfm                  1/1     Running   0          72m
fontbakery-worker-dfc84586b-kzqtk                  1/1     Running   0          71m
rabbitmq-54365647-gn4bz                            1/1     Running   0          5h33m
rethinkdb-0                                        1/1     Running   0          5h43m
rethinkdb-1                                        1/1     Running   0          6h
rethinkdb-2                                        1/1     Running   0          5h56m
rethinkdb-3                                        1/1     Running   0          5h36m
rethinkdb-proxy-64b674759b-2qhnh                   1/1     Running   0          5h30m