the way we manage the bots is a bit fucked - Githubissues

gwillen / herring

Modern Django and ReactJS based web application for puzzlehunt team management!

MIT License

7 stars 4 forks source link

the way we manage the bots is a bit fucked #32

Open gwillen opened 3 years ago

gwillen commented 3 years ago

This is just a general tracking issue / reminder about this problem. I don't have a specific approach for fixing this set in stone.

Setting ERRORS_TO_DISCORD doesn't cause this problem, but it does aggravate it, by sending a bunch of messages to discord immediately on startup, and a bunch in general, which seems to trigger the discord client to misbehave / get slow / get ratelimited / ???.

The way we manage our bots (discord listener and slack) is to run three Tasks concurrently, absorbing a celery process on its main thread, with one for each bot, plus one to maintain our hold on a Redis mutex that says we got this.

Previously we were using wait(), which was a bit janky in a few ways. Now we're using gather(), which has slightly different behavior -- if one of the tasks throws, it leaves the others running but returns (which causes us to release the mutex.) This is kind of janky. However the wait() behavior also seems janky -- if something throws in the discord listener, it seemingly should stay dead forever, while the redis-mutex-maintainer thread stays alive.

HOWEVER, our timeout in do_in_loop is 20, whereas our redis mutex has a 10s timeout. So in practice what happens seems to be this: the discord announcer bot loses its server connection / gets throttled / gets sad / ???. This hangs our thread for 20 seconds. Then it throws an exception. Because we hung for 20s, the mutex times out and is released.

Then with gather(), the thrown exception causes us to try to release the mutex, which fails because it already timed out; meanwhile the mutex-keeping thread tries to reacquire the mutex, which is expected to fail since we no longer hold it, and should also throw and die.

Whereas with wait(), we will keep wait()ing forever (guaranteed since the slack Task will never terminate if Slack is disabled.) But the mutex-keeping thread will still die, so in practice with wait() we leak a celery worker here, and with gather() we don't, but either way some other worker will notice the mutex is free, grab it, and start things up again.

gwillen commented 3 years ago

One thing we should probably do to simplify this: mutual exclusion of discord and slack, so there's only ever one bot-management task at most.

gwillen commented 3 years ago

Another thing we could do is disable ERRORS_TO_DISCORD and hope for the best, since at least nothing will be excessively provoking the issues that will still be in there.

It does seem to be the case that things basically work this way, they are just super spammy. So maybe it's fine?

None of this is actually touching on the real problem I ran into, which should maybe be its own separate issue, which is that the announcer bot sometimes fails to create, I suspect maybe because @lazy_object is not threadsafe so trying to do it twice in rapid succession causes disaster.

Failure to create the announcer bot seems to be a disaster since lots of stuff will then just never work, and we have no real way to recreate it.

gwillen commented 3 years ago

Oh, and the logs_to_discord stuff makes it really a lot worse when the announcer bot dies or can't be created, because right now we don't have any way to tell it's dead, and python log handling is synchronous, so anything that logs a WARNING or ERROR just gets kind of fucked. At least the first thing that does (if the bot fails to get started) just kind of hangs forever. (I have experimentally made it fail after a timeout instead but I have no idea what else this will break.)

gwillen commented 3 years ago

I think the fundamental problem with the announcer bot MAY be non-threadsafe behavior by lazy_object -- I think if you try to first-grab the object twice in rapid succession you get a disaster.

Zahariel commented 3 years ago

HOWEVER, our timeout in do_in_loop is 20, whereas our redis mutex has a 10s timeout. So in practice what happens seems to be this: the discord announcer bot loses its server connection / gets throttled / gets sad / ???. This hangs our thread for 20 seconds. Then it throws an exception. Because we hung for 20s, the mutex times out and is released.

The log_to_discord feature is actually provoking this a LOT more than it would normally be, because without that nothing in the gather() would actually be calling into the announcer bot and hitting that 20s non-async hang (because certainly the listener bot has no need to make a cross-thread call to the announcer bot to do things in Discord; it can just.... do them). Adding do_in_loop_nonblocking() should have helped, and indeed seemed to, but I think you're right that this is just an insane way to manage the whole thing.