open-contracting / deploy

Deployment configuration and scripts
https://ocdsdeploy.readthedocs.io/en/latest/
Apache License 2.0
2 stars 3 forks source link

Investigate RabbitMQ restarts #481

Closed jpmckinney closed 5 months ago

jpmckinney commented 5 months ago

The info messages in /var/log/rabbitmq/rabbit@ocp##.log don't seem relevant. Can search with grep -v info or zgrep -v info to find the other error levels (notice, warning, error).

The registry server (ocp13) on 2014-01-18 10:10:46 got "RabbitMQ is asked to stop...", and it stopped by 2024-01-18 10:10:51. It then started again on 2024-01-18 10:10:54.

Looking in Prometheus, the only signals are that memory usage and swapped dropped after restart (not surprising), but it was not high before restart (40%, 175MB).

Looking at /var/log/syslog at the same time, I see messages relating to apt around the same time, so I assume RabbitMQ was upgraded and therefore restarted.

This generated messages in Kingfisher Collect, because it uses a blocking connection and not an async client (only the latter can handle connection close events). To resolve that, we need to close https://github.com/open-contracting/kingfisher-collect/issues/1033


I'll keep this issue open to investigate any other restarts. #238 explains another restart scenario.

dogsbody-ashley commented 5 months ago

This was due to a rabbitmq-sever patch

jpmckinney commented 5 months ago

I closed https://github.com/open-contracting/kingfisher-collect/issues/1033, so I'll close this issue.

If there are any new RabbitMQ-related messages in Sentry, I can use this issue in future.

jpmckinney commented 5 months ago

RabbitMQ restarts might still cause errors to be reported. If so, I think the solution is here: https://github.com/open-contracting/yapw/issues/2#issuecomment-1911046356

Kingfisher Collect has had issues with restarts, because it only publishes messages, and over a long period of time. The others only ack/nack/publish messages after consuming a message. Since RabbitMQ cancels consumers when restarting, there is maybe only a narrow window in which the consumer can attempt a method on a closing/closed connection.