postalserver / postal

📮 A fully featured open source mail delivery platform for incoming & outgoing e-mail
https://postalserver.io
MIT License
14.87k stars 1.05k forks source link

If one rabbit pod crashes, postal apps freeze #2067

Closed chrism417 closed 2 years ago

chrism417 commented 2 years ago

Describe the bug

We're running rabbitmq-ha and if any of our three pods crash due to OOMKilling or just being moved to a new node, every app connecting to rabbit freezes and doesn't restart. For example, the worker will stop with the following logs but will never restart or reconnect:

W, [2022-07-11T20:21:51.437658 #1] WARN -- #<Bunny::Session:0x55cdcfb158f8 postal@postal-rabbit.default:5672, vhost=postal, addresses=[postal-rabbit.default:5672]>: Recovering from connection.close (CONNECTION_FORCED - broker forced connection closure with reason 'shutdown') W, [2022-07-11T20:21:51.438122 #1] WARN -- #<Bunny::Session:0x55cdcfb158f8 postal@postal-rabbit.default:5672, vhost=postal, addresses=[postal-rabbit.default:5672]>: Will recover from a network failure (no retry limit)... W, [2022-07-11T20:22:01.438561 #1] WARN -- #<Bunny::Session:0x55cdcfb158f8 postal@postal-rabbit.default:5672, vhost=postal, addresses=[postal-rabbit.default:5672]>: Retrying connection on next host in line: postal-rabbit.default:5672 W, [2022-07-11T20:22:16.449911 #1] WARN -- #<Bunny::Session:0x55cdcfb158f8 postal@postal-rabbit.default:5672, vhost=postal, addresses=[postal-rabbit.default:5672]>: TCP connection failed, reconnecting in 5.0 seconds W, [2022-07-11T20:22:16.450312 #1] WARN -- #<Bunny::Session:0x55cdcfb158f8 postal@postal-rabbit.default:5672, vhost=postal, addresses=[postal-rabbit.default:5672]>: Will recover from a network failure (no retry limit)... W, [2022-07-11T20:22:26.450770 #1] WARN -- #<Bunny::Session:0x55cdcfb158f8 postal@postal-rabbit.default:5672, vhost=postal, addresses=[postal-rabbit.default:5672]>: Retrying connection on next host in line: postal-rabbit.default:5672 E, [2022-07-11T20:24:28.548876 #1] ERROR -- #<Bunny::Session:0x55cdcfb158f8 postal@postal-rabbit.default:5672, vhost=postal, addresses=[postal-rabbit.default:5672]>: Got an exception when receiving data: IO timeout when reading 7 bytes (Timeout::Error) W, [2022-07-11T20:24:28.549027 #1] WARN -- #<Bunny::Session:0x55cdcfb158f8 postal@postal-rabbit.default:5672, vhost=postal, addresses=[postal-rabbit.default:5672]>: Exception in the reader loop: Timeout::Error: IO timeout when reading 7 bytes W, [2022-07-11T20:24:28.549077 #1] WARN -- #<Bunny::Session:0x55cdcfb158f8 postal@postal-rabbit.default:5672, vhost=postal, addresses=[postal-rabbit.default:5672]>: Backtrace: W, [2022-07-11T20:24:28.549126 #1] WARN -- #<Bunny::Session:0x55cdcfb158f8 postal@postal-rabbit.default:5672, vhost=postal, addresses=[postal-rabbit.default:5672]>: /usr/local/bundle/gems/bunny-2.14.4/lib/bunny/cruby/socket.rb:68:inrescue in read_fully' W, [2022-07-11T20:24:28.549164 #1] WARN -- #<Bunny::Session:0x55cdcfb158f8 postal@postal-rabbit.default:5672, vhost=postal, addresses=[postal-rabbit.default:5672]>: /usr/local/bundle/gems/bunny-2.14.4/lib/bunny/cruby/socket.rb:56:in read_fully' W, [2022-07-11T20:24:28.549309 #1] WARN -- #<Bunny::Session:0x55cdcfb158f8 postal@postal-rabbit.default:5672, vhost=postal, addresses=[postal-rabbit.default:5672]>: /usr/local/bundle/gems/bunny-2.14.4/lib/bunny/transport.rb:239:inread_fully' W, [2022-07-11T20:24:28.549332 #1] WARN -- #<Bunny::Session:0x55cdcfb158f8 postal@postal-rabbit.default:5672, vhost=postal, addresses=[postal-rabbit.default:5672]>: /usr/local/bundle/gems/bunny-2.14.4/lib/bunny/transport.rb:261:in read_next_frame' W, [2022-07-11T20:24:28.549347 #1] WARN -- #<Bunny::Session:0x55cdcfb158f8 postal@postal-rabbit.default:5672, vhost=postal, addresses=[postal-rabbit.default:5672]>: /usr/local/bundle/gems/bunny-2.14.4/lib/bunny/reader_loop.rb:74:inrun_once' W, [2022-07-11T20:24:28.549361 #1] WARN -- #<Bunny::Session:0x55cdcfb158f8 postal@postal-rabbit.default:5672, vhost=postal, addresses=[postal-rabbit.default:5672]>: /usr/local/bundle/gems/bunny-2.14.4/lib/bunny/reader_loop.rb:39:in block in run_loop' W, [2022-07-11T20:24:28.549375 #1] WARN -- #<Bunny::Session:0x55cdcfb158f8 postal@postal-rabbit.default:5672, vhost=postal, addresses=[postal-rabbit.default:5672]>: /usr/local/bundle/gems/bunny-2.14.4/lib/bunny/reader_loop.rb:36:inloop' W, [2022-07-11T20:24:28.549390 #1] WARN -- #<Bunny::Session:0x55cdcfb158f8 postal@postal-rabbit.default:5672, vhost=postal, addresses=[postal-rabbit.default:5672]>: /usr/local/bundle/gems/bunny-2.14.4/lib/bunny/reader_loop.rb:36:in run_loop' W, [2022-07-11T20:24:28.549412 #1] WARN -- #<Bunny::Session:0x55cdcfb158f8 postal@postal-rabbit.default:5672, vhost=postal, addresses=[postal-rabbit.default:5672]>: Will recover from a network failure (no retry limit)... W, [2022-07-11T20:24:38.549794 #1] WARN -- #<Bunny::Session:0x55cdcfb158f8 postal@postal-rabbit.default:5672, vhost=postal, addresses=[postal-rabbit.default:5672]>: Retrying connection on next host in line: postal-rabbit.default:5672 E, [2022-07-11T20:26:38.561127 #1] ERROR -- #<Bunny::Session:0x55cdcfb158f8 postal@postal-rabbit.default:5672, vhost=postal, addresses=[postal-rabbit.default:5672]>: Got an exception when receiving data: IO timeout when reading 7 bytes (Timeout::Error) W, [2022-07-11T20:28:38.547883 #1] WARN -- #<Bunny::Session:0x55cdcfb158f8 postal@postal-rabbit.default:5672, vhost=postal, addresses=[postal-rabbit.default:5672]>: Recovering from connection.close (CONNECTION_FORCED - broker forced connection closure with reason 'shutdown') W, [2022-07-11T20:28:38.548052 #1] WARN -- #<Bunny::Session:0x55cdcfb158f8 postal@postal-rabbit.default:5672, vhost=postal, addresses=[postal-rabbit.default:5672]>: Will recover from a network failure (no retry limit)... W, [2022-07-11T20:28:48.548464 #1] WARN -- #<Bunny::Session:0x55cdcfb158f8 postal@postal-rabbit.default:5672, vhost=postal, addresses=[postal-rabbit.default:5672]>: Retrying connection on next host in line: postal-rabbit.default:5672

To Reproduce

Run postal cron or postal worker Run rabbitmq-ha Delete one rabbitmq pod

Expected behaviour

If any app connecting to rabbit fails, restart the app or reconnect

Environment details

Deployed in k8s

willpower232 commented 2 years ago

The stack trace implies that it already retried and ran out of attempts. Unfortunately Postal is no longer responsible for keeping itself running, that is down to docker (or kubernetes) so the onus is on your monitoring I'm afraid.

chrism417 commented 2 years ago

The stack trace implies that it already retried and ran out of attempts. Unfortunately Postal is no longer responsible for keeping itself running, that is down to docker (or kubernetes) so the onus is on your monitoring I'm afraid.

I understand, however if the onus is on us, then postal should be monitoring the apps for the crash.

chrism417 commented 2 years ago

@willpower232 also, the requeuer restarts the app when it loses connection to rabbit, but none of the other apps do. Can this same restart be applied to cron/worker/etc?