Closed dsent closed 1 month ago
What is your OpenAI-related configuration for tg-spam? Actually, it would help to see the whole configuration.
I have checked the code and don't really see how this is possible unless you run the system in dry or training mode. As long as the OpenAI check is invoked and spam is detected, it should treat the overall result as spam. There are multiples tests covering all the cases with OpenAI detection, and I can't see anything wrong here.
The system was in the training mode. Still, the spam messages should've been marked, right? Without actually banning the users who posted them.
The full config (docker-compose.yml
) was:
version: '3.8'
services:
tg-spam:
image: umputun/tg-spam:latest
hostname: tg-spam
restart: always
container_name: tg-spam
user: <redacted>
environment:
- TZ=Europe/Belgrade
- TELEGRAM_TOKEN=<redacted>
- TELEGRAM_GROUP=<redacted>
- ADMIN_GROUP=<redacted>
- LOGGER_ENABLED=true
- LOGGER_FILE=/srv/log/tg-spam.log
- LOGGER_MAX_SIZE=5M
- FILES_DYNAMIC=/srv/var
- NO_SPAM_REPLY=true
- OPENAI_TOKEN=<redacted>
- OPENAI_VETO=true
- OPENAI_MODEL=gpt-4o
- MAX_EMOJI=-1
- MESSAGE_WARN=""
- TRAINING=true
- HISTORY_DURATION=24h
- HISTORY_MIN_SIZE=5000
volumes:
- ./log:/srv/log
- ./var:/srv/var
- ./data:/srv/data
command: --super=<redacted> --super=<redacted>
Now I changed it from training mode to production mode with softbans and OpenAI veto disabled (because of #138). Will probably see if it worked in a few hours when all the spam bots are back to work 😄
There was something that shouldn't have affected the outcome in any way, but still. We started to get those missed spam messages (I mean, spam messages not being classified as spam) after a couple more admins had joined the admin group (actually, we haven't received a single message initiated by the bot since that happened, despite the bot's docker container being restarted a couple of times). I've removed and added the bot to the admin group again, just out of a superstition. Probably just a coincidence, but giving it here for the sake of context completeness.
Another thing worth mentioning is how I found out about OpenAI's classification: 1) Spam message is posted and not being detected as spam. 2) I mark message as spam by replying "spam" to it in the main group. 3) Then I receive a message from the bot in the admin group that says "openai result is positive".
So there's no evidence that openai check was even performed when the message first reached the bot. For all I know (without understanding much of the code), the check could've happened only after I marked the message as spam manually. Or at some time in-between, but after the initial "ham" conclusion was made by the bot.
The full message was this (after replying "spam" to the spammer):
original detection results for (6176285046)
👑Сᴧиᴛыᴇ ᴏбнᴀжᴇнᴏчᴋи дᴇʙуɯᴇᴋ, ᴨᴩᴏбᴇй ᴧюбую ɯᴋуᴩу ᴄʙᴏᴇᴦᴏ ᴦᴏᴩᴏдᴀ.
- stopword: ham, not found
- cas: ham, record not found
- similarity: ham, 0.00/0.50
- classifier: ham, probability of ham: 90.98%
- openai: spam, Message contains solicitation for adult content and services, often associated with spam, confidence: 95%
the user banned by "dsent\\_zen" and message deleted
I assumed that "original detection results for" are the cached results from the first, automated run of the detection (in that case they are strange as the message would've been detected because of the openai's positive result). But are they really?
The system was in the training mode. Still, the spam messages should've been marked, right? Without actually banning the users who posted them.
according to docs: "--training - if set, the bot will not ban users and delete messages but will learn from them. This is useful for training purposes.". Those detected spam messages should be forwarded to your admin group. I'm not sure what you meant by "marked," but it won't remove the message by itself in this mode.
I assumed that "original detection results for" are the cached results from the first, automated run of the detection (in that case they are strange as the message would've been detected because of the openai's positive result). But are they really?
No, this is not cached in any way. The moment you send the /spam
reply, the full spam detection is invoked and the result is posted to the admin group. It also adds user to the list of spammers and stores the message and results of checks to the internal db
You can try setting DEBUG=true in the compose environment and check the container's log at the moment the missing spam occurred. This may give us some clues.
So the problem is not that messages are marked by OpenAI check as spam but ignored as I thought initially. The likely problem is that OpenAI check wasn't invoked at all, or returned an error, so the message was marked as ham. And when I mark the message as spam manually, it goes through all the checks again, and this time OpenAI's classification gets invoked properly (but doesn't affect anything at this point). I'll enable debug and see if it happens again.
Yeah, this could be correct. Retrying on OpenAI may help to minimize the issue https://github.com/umputun/tg-spam/pull/140
@umputun By the way, I can't find the results of OpenAI check on messages not detected as spam. Here is the entirety of what I see in the logs for a message deemed ham:
2024/10/26 15:04:01,stdout,"2024/10/26 15:04:01.706 [DEBUG] {bot/spam.go:104 bot.(*SpamFilter).OnMessage} user __ is not a spammer, {name: stopword, spam: false, details: not found}, {name: cas, spam: false, details: record not found}, {name: similarity, spam: false, details: 0.12/0.50}, {name: classifier, spam: false, details: probability of ham: 99.44%}
OpenAI should have been invoked, but the results are not logged. Should I open a separate issue for this?
I'm seeing a consistent pattern of messages, classified as spam by OpenAI checker, not being flagged automatically. After marking the message as spam manually, I see things like this:
So, it should've been marked by the bot automatically, but it haven't. I'm not sure how to debug this further.