sleeyax / aternos-discord-bot

Discord bot to start & stop a Minecraft server automatically
https://hub.docker.com/r/sleeyax/aternos-discord-bot/tags
MIT License
75 stars 93 forks source link

Resource starvation due to reconnect spam #23

Closed sleeyax closed 2 years ago

sleeyax commented 2 years ago

Logs in prod seem to be cluttered with the following messages, which causes a lot of disk pressure causing pods to evict:

2022/05/06 12:09:00 worker 9688xxxxxx: Message channel closed. Trying to reconnect...
2022/05/06 12:09:00 worker 9688xxxxxx: Message channel closed. Trying to reconnect...
2022/05/06 12:09:00 worker 9688xxxxxx: Message channel closed. Trying to reconnect...
2022/05/06 12:09:00 worker 9688xxxxxx: Message channel closed. Trying to reconnect...
2022/05/06 12:09:00 worker 9688xxxxxx: Message channel closed. Trying to reconnect...
2022/05/06 12:09:00 worker 9688xxxxxx: Message channel closed. Trying to reconnect...
2022/05/06 12:09:00 worker 9688xxxxxx: Message channel closed. Trying to reconnect...
2022/05/06 12:09:00 worker 9688xxxxxx: Message channel closed. Trying to reconnect...
2022/05/06 12:09:00 worker 9688xxxxxx: Message channel closed. Trying to reconnect...
2022/05/06 12:09:00 worker 9688xxxxxx: Message channel closed. Trying to reconnect...
2022/05/06 12:09:00 worker 9688xxxxxx: Message channel closed. Trying to reconnect...
2022/05/06 12:09:00 worker 9688xxxxxx: Message channel closed. Trying to reconnect...
2022/05/06 12:09:00 worker 9688xxxxxx: Message channel closed. Trying to reconnect...
2022/05/06 12:09:00 worker 9688xxxxxx: Message channel closed. Trying to reconnect...
2022/05/06 12:09:00 worker 9688xxxxxx: Message channel closed. Trying to reconnect...
2022/05/06 12:09:00 worker 9688xxxxxx: Message channel closed. Trying to reconnect...
2022/05/06 12:09:00 worker 9688xxxxxx: Message channel closed. Trying to reconnect...

I should either reduce logging or figure out why the same worker is trying to reconnect so many times over and over again.

sleeyax commented 2 years ago

The problem is we keep retrying to connect on error (e.g. aternos is temporary down). We should store a max reconnects value and return an error instead.