Consider creating a reporting queue for manual review prior to sending off to server admins

https://github.com/umm-maybe/fediautomod/blob/7c6bbe3b42a7acc0b53e9d81baf3b5d459ae4d85/automod.py#L64

I would personally highly prefer this method to have a manual filtering step for the user. My biggest fear is what if the detoxify model parameters are chosen poorly and that leads to over-reporting of language that really isn't abusive at all.

On the one hand, it may seem fine to submit a lot of over-cautious reports to the admins, but that adds work to potentially overburdened server admins who haven't consented to this (which could lead to bans). But more importantly, it could actually end up targeting harassed users who are responding to intentional provocation. If a harassed person gets the report because they've been provoked and the admin responds too quickly to the auto-report, then this system could actually increase harm.

*Just putting down more thoughts contrasting this system with what we're more used to on reddit: Since this automod system's design would be instigated by regular users rather than the official server mod team, it seems like an overstep to have it submit automatically. Comparing to my experience moderating a bit on reddit, it seems like the automod system should report up to some human user who is aware of it and consents to being responsible for responding to potential errors. On reddit, the team I mod with now all know about our automod and understand what to expect.

I should clarify that my original thought was that this bot would most likely be run by an actual instance admin on their own local feed, so the reporting queue and interface you describe would be redundant with what Mastodon already provides--and admins provide the human layer that I agree is needed. But, the bot doesn't do anything that requires admin rights, so it could in fact be run by anyone, and if it isn't being run by an instance admin, then what you propose would be better than automatically submitting reports, as Detoxify does get things wrong and the report threshold is a very subjective setting.

Actually, though, the prospect of regular users running different copies of the bot pointing to the same instance is worrisome, TBH, because even with human curation/filtering, the admins could wind up receiving a lot of duplicate reports, and I don't know how well that would go. This is where I got the idea of the bot making a private status update in reply to any post it flags as toxic. Though otherwise not public, anyone following the bot would be able to see those updates, including other copies of the bot--each copy could check to see if the post has already been flagged (though how we get them to all follow each other is another matter).

That being said, I can see how, with your manual review queue, the prospect of many people evaluating whether a post is really toxic or not could be a feature, not a bug, in the sense that it crowd-sources that judgment call (so long as the people voting are doing so in good faith and with an informed understanding of how to respond sensitively to such situations). What if instead of a dashboard/queue for each bot operator, there was some sort of shared database to which potential reports were written (no more than once, i.e. each bot would check for duplicate records), and each of these flagged status updates needed to be reviewed and acknowledged as toxic by at least N humans before reporting to mods.

So, in the "regular user" case, this could quickly evolve from an automod-like tool for admins to use in making their moderation job easier, to a network of bots tied together with a network of humans, perhaps along with some verification/vetting/training steps humans must take to log in and participate... more complicated but perhaps more transformative, if it can be made to work.

My thoughts would be that if it's meant to be used by server admins I'd suggest building this as an actual plugin on the server side. This will both make it something that admins control and allow it to prescreen content before it even shows up in someone's feed and causes harm. (This might be the best idea really. I am thinking Twitter was doing something like that already. I seem to remember there being a button at the bottom of reply threads that said something like "show replies that contain offensive content.")

some sort of shared database

Yes, I was thinking of a shared queue, but I never thought of doing it with requiring multiple votes. I'm still sort of on the #4 idea thinking that one person volunteers to take one from the queue.

so long as the people voting are doing so in good faith and with an informed understanding of how to respond sensitively to such situations

The more I think about it, the more it feels like this is like 97% of the problem and the tech stack is 3%. The automation can be so quickly abused that if this did become a common approach it would seem like it would take no time before someone used the same techniques to automate harm and/or impersonate the bots to drown out any signal. That being said, I can't figure out why twitter wouldn't hire someone to just go code this up right now to bring down mastodon? Maybe they have?

re: Mastodon server extension... this idea seems solid, but I have no idea how to code one. Searching for examples led me to this blog post that shares a whole bunch of relevant ideas, although it's older (pre-Detoxify)... maybe worth getting in touch with the author to see if they're still interested?

https://dustycloud.org/blog/possible-distributed-anti-abuse/

I should also mention that I've found out that it's not possible to watch a server unless you have an account and access key on that server; i.e. I can't directly monitor mastodon.social from sigmoid.social, although I can forward reports to mastodon.social admins if a post from there shows up on the sigmoid.social federated feed.

Finally, what about a daily email digest to the moderator or a mailing list of helpers? This seems like a simple, if old-school approach which is easier to protect against users acting in bad faith. I agree 100% about the human element outweighing code here; while coding a simple script to flag toots seems well within my reach, the organizing aspect is much more daunting.

umm-maybe / fediautomod

Consider creating a reporting queue for manual review prior to sending off to server admins #7