Closed sleeyax closed 2 years ago
Logs in prod seem to be cluttered with the following messages, which causes a lot of disk pressure causing pods to evict:
2022/05/06 12:09:00 worker 9688xxxxxx: Message channel closed. Trying to reconnect... 2022/05/06 12:09:00 worker 9688xxxxxx: Message channel closed. Trying to reconnect... 2022/05/06 12:09:00 worker 9688xxxxxx: Message channel closed. Trying to reconnect... 2022/05/06 12:09:00 worker 9688xxxxxx: Message channel closed. Trying to reconnect... 2022/05/06 12:09:00 worker 9688xxxxxx: Message channel closed. Trying to reconnect... 2022/05/06 12:09:00 worker 9688xxxxxx: Message channel closed. Trying to reconnect... 2022/05/06 12:09:00 worker 9688xxxxxx: Message channel closed. Trying to reconnect... 2022/05/06 12:09:00 worker 9688xxxxxx: Message channel closed. Trying to reconnect... 2022/05/06 12:09:00 worker 9688xxxxxx: Message channel closed. Trying to reconnect... 2022/05/06 12:09:00 worker 9688xxxxxx: Message channel closed. Trying to reconnect... 2022/05/06 12:09:00 worker 9688xxxxxx: Message channel closed. Trying to reconnect... 2022/05/06 12:09:00 worker 9688xxxxxx: Message channel closed. Trying to reconnect... 2022/05/06 12:09:00 worker 9688xxxxxx: Message channel closed. Trying to reconnect... 2022/05/06 12:09:00 worker 9688xxxxxx: Message channel closed. Trying to reconnect... 2022/05/06 12:09:00 worker 9688xxxxxx: Message channel closed. Trying to reconnect... 2022/05/06 12:09:00 worker 9688xxxxxx: Message channel closed. Trying to reconnect... 2022/05/06 12:09:00 worker 9688xxxxxx: Message channel closed. Trying to reconnect...
I should either reduce logging or figure out why the same worker is trying to reconnect so many times over and over again.
The problem is we keep retrying to connect on error (e.g. aternos is temporary down). We should store a max reconnects value and return an error instead.
Logs in prod seem to be cluttered with the following messages, which causes a lot of disk pressure causing pods to evict:
I should either reduce logging or figure out why the same worker is trying to reconnect so many times over and over again.