Open lemonsaurus opened 5 years ago
I don't think we should implement some kind of advanced error detection mechanism here because the case that this happens will be exceptionally rare. I think the simplest approach would be reordering the current execution sequence for mod commands, such that the actual moderative action is performed before the infraction is created in the backend. From my own code (not Python, but you get the idea):
{:ok} <- Api.create_guild_ban(msg.guild_id, user_id, 7),
infraction <- %{
type: "ban",
guild_id: msg.guild_id,
user_id: user_id,
actor_id: msg.author.id,
reason: if(reason != "", do: reason, else: nil)
},
changeset <- Infraction.changeset(%Infraction{}, infraction),
{:ok, _created_infraction} <- Repo.insert(changeset) do
We could then handle errors caused by the site being down (connection refused etc.) in a command error handler for mod commands and simply output something saying that the command was executed but the site could not be contacted there.
The "deferring" part also just sounds like a load of work to me for something that is rather niche. How will you contact the site periodically for new infractions? How will you notify the user once the site is reachable again? What happens if the bot restarts? That said, from a moderation perspective that also just seems unacceptable à la "I got your command but I'm choosing to ignore it until the site blobbers back up". Or did you just mean the infractions themselves? In that case, we still have the issues mentioned above.
First of all, yes, we should reorder the execution sequence so that the moderative actions happen immediately even if no infraction can be created.
The other part is to try to deliver the infractions to the API later.I was thinking something along the lines of keeping the infractions stored somewhere very simple (in memory? sqllite? something like that.) and then scheduling a new attempt at delivering them to the API every 30 minutes or something. Whenever this fails, the bot would post some sort of mod-log letting us know that the API is still down, and the bot currently has n undelivered infractions.
If the bot restarts while the API is down, then tough titties, we'll lose the infractions. Best effort is good enough, given how edge casey this is already.
If the bot restarts while the API is down, then tough titties, we'll lose the infractions. Best effort is good enough, given how edge casey this is already.
Would only be the case if they will be stored in memory rather than on disk.
yeah, that's why I mentioned SQLite or some simple file format like that. we could very easily create a (persistant) docker volume that we could store the file on. but, to be clear, it's probably overkill to do so.
I don't know if this issue is still relevant, when was the last time the API went down without the entire server going down? Regardless, the best solution today would probably be something to do with redis, Maybe a redis queue that we try to keep empty and add to if API requests fail. I don't think bypassing the API and going straight for the DB is very wise.
What happens when Redis goes down? Well first contact DevOps because something has gone horribly wrong if both site and redis are down, then probably perform the action, and ask the user to repeat it later when the API is up.
The problem with asking the user to repeat it is that, well, you have to trust the user will repeat it. I think having the bot perform the action, at the risk of not adding it to the DB, is better than no action at all. If a user tries to ban someone, but the bot gives up completely, they'll just ban manually, and now we don't even have anything to ask them to repeat.
Interesting to see that this issue has been open since August 2019 haha
In terms of today's bot, the plans are to import the idea we had for rattlesnake. Yes, it will most likely make use of redis.
I don't know if this issue is still relevant, when was the last time the API went down without the entire server going down? Regardless, the best solution today would probably be something to do with redis, Maybe a redis queue that we try to keep empty and add to if API requests fail. I don't think bypassing the API and going straight for the DB is very wise.
What happens when Redis goes down? Well first contact DevOps because something has gone horribly wrong if both site and redis are down, then probably perform the action, and ask the user to repeat it later when the API is up.
The problem with asking the user to repeat it is that, well, you have to trust the user will repeat it. I think having the bot perform the action, at the risk of not adding it to the DB, is better than no action at all. If a user tries to ban someone, but the bot gives up completely, they'll just ban manually, and now we don't even have anything to ask them to repeat.
The cases that often happen are the site is down for postgres update, or something else similar.
In that case, I feel like the bot sending what infractions its taken would be the most reliable solution, to ensure users don't get perma banned for a tempban.
Usually those updates happen for short durations, usually during times that aren't very high activity.
I'm not sure if the solution you're suggesting is different from the one I am or not. I'm not sure what "send what infractions it's taken" means in this context.
Usually those updates happen for short durations, usually during times that aren't very high activity.
I'm not sure if the solution you're suggesting is different from the one I am or not. I'm not sure what "send what infractions it's taken" means in this context.
Ah, I was thinking about an edge case where the bot is rebooted while the database is unreachable, meaning that if a moderator used a tempban, the user would not have it scheduled anywhere to unban, for example.
That's what redis is for. We keep the task in redis till we get confirmation from site.
Currently, if the API goes down, moderation goes down.
In an ideal world, we'd like to allow moderation to continue even if the API is down. This could be achieved either by giving the bot direct access to our database, or by deferring the moderative actions until such a time as a connection can be re-established.
I believe the latter option may be the better option, because it would work even if the database was down, too. The database is hosted on the same server as the API and the bot (PyDis Alpha). If that server goes down, we plan to spin up another bot instance on PyDis Bravo. If this bot could simply send over all the moderation actions once the server comes back up, then we wouldn't need database replication to PyDis Bravo (as long as we have off-site backups).
Feel free to use this issue to discuss the best approach. I don't think this is urgent (because it's a fairly unlikely edge case and we can still ban people manually), but it's something to consider for the future.