Open graphicore opened 4 years ago
Here's another related Error in the fontbakery-github-operations
service. The job was fine and just reporting back the result ( _p._sendDispatchResult
) via the rabbitmq
queue. I guess a big problem we had with this restart is that the rabbitmq
server was down and the clients lost the connection, but they only just realize when they are about to send a result ...
DEBUG _push DONE
DEBUG _gitHubPR: graphicore:Font_Bakery_Dispatcher_2019_10_11_ofl_abeezee => graphicore/googleFonts master
(node:1) UnhandledPromiseRejectionWarning: IllegalOperationError: Channel closed
at Channel.<anonymous> (/var/javascript/node_modules/amqplib/lib/channel.js:160:11)
at Channel.C._rpc (/var/javascript/node_modules/amqplib/lib/channel.js:142:8)
at /var/javascript/node_modules/amqplib/lib/channel_model.js:59:17
at tryCatcher (/var/javascript/node_modules/bluebird/js/release/util.js:16:23)
at Function.Promise.fromNode.Promise.fromCallback (/var/javascript/node_modules/bluebird/js/release/promise.js:185:30)
at Channel.C.rpc (/var/javascript/node_modules/amqplib/lib/channel_model.js:58:18)
at Channel.C.assertQueue (/var/javascript/node_modules/amqplib/lib/channel_model.js:86:15)
at IOOperations._p.sendQueueMessage (/var/javascript/node/util/IOOperations.js:123:30)
at GitHubOperationsServer._p._sendDispatchResult (/var/javascript/node/GitHubOperationsServer.js:449:21)
at Promise.all.then.then.then.report (/var/javascript/node/GitHubOperationsServer.js:431:24)
at processTicksAndRejections (internal/process/task_queues.js:86:5)
(node:1) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 1)
If it's the rabbitmq
service we can provoke this problem when we restart it from a running, stable state. We should see similar problems. Usually, that service is rock solid and never needs a restart, also because it's hardly updated.
And indeed therabbitmq
pod age is 5h33m
and github-operations
is 5h47m
, so rabbitmq was restarted after github-operations
was online.
(pods much younger in the example below were re-initialized by me kubectl delete pod ...
)
> kubectl get pods
NAME READY STATUS RESTARTS AGE
fontbakery-api-6bb4b7789-r5cxl 1/1 Running 3 5h30m
fontbakery-dispatcher-6d4c66cc94-r4gdj 1/1 Running 0 63m
fontbakery-github-auth-558594dc5c-p2mjb 1/1 Running 0 5h30m
fontbakery-github-operations-68f5f75d8c-s2hbd 1/1 Running 0 5h47m
fontbakery-init-workers-78c7578866-fbcd9 1/1 Running 0 49m
fontbakery-manifest-csvupstream-7bfc84f4fb-wtn5c 1/1 Running 0 93m
fontbakery-manifest-gfapi-7d9c898fcb-lf6xz 1/1 Running 0 5h47m
fontbakery-manifest-githubgf-6795f6c9df-7tqn7 1/1 Running 0 5h30m
fontbakery-manifest-master-5c69ccccb-8ml9f 1/1 Running 0 53m
fontbakery-reports-c89b49674-j7wtp 1/1 Running 3 5h30m
fontbakery-storage-cache-fb54d56cf-r4pm9 1/1 Running 0 5h47m
fontbakery-storage-persistence-66d87689df-7qj84 1/1 Running 0 5h51m
fontbakery-worker-dfc84586b-9cpfm 1/1 Running 0 72m
fontbakery-worker-dfc84586b-kzqtk 1/1 Running 0 71m
rabbitmq-54365647-gn4bz 1/1 Running 0 5h33m
rethinkdb-0 1/1 Running 0 5h43m
rethinkdb-1 1/1 Running 0 6h
rethinkdb-2 1/1 Running 0 5h56m
rethinkdb-3 1/1 Running 0 5h36m
rethinkdb-proxy-64b674759b-2qhnh 1/1 Running 0 5h30m
A lot of pods didn't communicate well after the restart. There were database connection issues etc. The kubernetes
CrashLoopBackOff
mechanism didn't bring the cluster into a stable state, so I guess some of the services need to do a proper self test after starting up and die when they are not well connected.Also, if a connection goes bad after a node initialized successfully, i.e. after initialization, it should shut down itself as well, gracefully waiting for all running tasks to fail or finish before that.
Here's a log of the
fontbakery-manifest-csvupstream
service struggling to provide files packes after this incident:fbd_csvupstream_crashes.txt
Another annoyance was the font bakery reports cache that uses the
ENVIRONMENT_VERSION
variable. I had to reset that variable and restart thefontbakery-init-workers
service to get font bakery workers executing for some processes. Caching is not a bad idea here, but that was a good example how it can go wrong! example procees and its stuck Font Bakery report. Timeouts for workers would be nice here too, especially if we could free the stuck Font Bakery reports as well.