umputun / tg-spam

Anti-Spam bot for Telegram
https://tg-spam.umputun.dev
MIT License
165 stars 29 forks source link

Error processing forwarded messages #107

Open coperius opened 1 month ago

coperius commented 1 month ago
  1. Messages are stored using a hash of their content.
  2. This causes overwriting when different messages have the same content hash.
  3. When forwarding a message that has content duplicates:
    • The system processes the most recent message with that content.
    • This leads to deletion and banning of the sender of the most recent message.
    • The original forwarded message is not processed as intended.
  4. This creates issues with subsequent forwarding of messages.

This seems to be a significant problem in the message handling system. It could lead to incorrect message attribution, unintended bans, and loss of original message context.

Version: master-68dcc02-20240805T10:42:41 paranoid mode on

How to reproduce the problem

  1. Send message from first user

    Screenshot 2024-08-05 at 18 33 03
  2. Send message from second user

    Screenshot 2024-08-05 at 18 33 30

    the record in the database is overwritten Messages in chat

    Screenshot 2024-08-05 at 18 34 00
  3. Forward first message to admin chat

    Screenshot 2024-08-05 at 18 37 36

    But in the main chat, the second message was deleted and the user Coperius was banned

    Screenshot 2024-08-05 at 18 37 17
  4. Try to forward the first message again

    Screenshot 2024-08-05 at 18 38 27

And it is not processed

Is this functionality needed at all if it's fully implemented through reply message with a spam tag?

umputun commented 1 month ago

I understand your point, but I'm not entirely certain about the significance of the problem. If two identical spam messages are not detected and are forwarded as spam by the admin, we won't be able to process the first one. However, in practice, I have not encountered a situation like this yet.

As you may already know, hash matching is a component of the locator. It involves records added by TelegramListener for both spam and non-spam messages, which are used by the admin to locate matched messages as part of MsgHandler (for handling admin forwards) in order to find the original message. To the best of my recollection, this is the only purpose hash serves, unless I am overlooking something.

If this scenario occurs, the issue primarily impacts an uncommon and somewhat unnecessary feature related to forwarding support for admin bans. In the past, this method was the sole means to manually ban undetected spam. However, nowadays, responding with /spam accomplishes this task more efficiently and directly. Therefore, the hash lookup doesn't concern me greatly, as the potential problem is relatively minor. Nevertheless, another more significant consequence of using the hash as a primary key is the possibility of generating incorrect results for UserNameByID and UserIDByName. Although I haven't investigated the implications of this mismatch, it doesn't instill confidence in me.

One potential solution could involve changing the primary key to UUID, which would help prevent unintended overwrites. The Message method still poses challenges, as it may restrict us to processing only one message. Yet, as mentioned earlier, this aspect doesn't trouble me too much, and I can accept that admin forwards will ban the most recent matched spam instead of the initial one. However, if you have a better suggestion, please share it. Additionally, if you believe the fix I outlined would be effective, feel free to submit a pull request with the proposed solution.

coperius commented 1 month ago

The pair chatID/MessageID is a unique key within a chat. We can use it. And if we're not concerned about deleting the wrong spammer, then by using this pair we can delete all spam messages with the text from the forwarded message and ban all spammers who sent them

umputun commented 1 month ago

The reason why we have a hash of the message in the first place is because for forwarded messages, I have not found any way to retrieve the MessageID of the original message. I think this is intentional in Telegram for privacy reasons.

coperius commented 1 month ago

We will use the chatID/MessageID pair only for message uniqueness, to avoid introducing a UUID. It's already present in the table and will suit us just fine. We'll continue to search for messages to delete by hash. But now instead of one, we'll get a list and process them all (delete and ban the senders).