jjj333-p / spam-police

A matrix bot to monitor and respond to investment scam spamming across the matrix platform, for example in rooms with a permanently offline admin.
GNU Affero General Public License v3.0
21 stars 8 forks source link

string distance / fuzzy matching instead of hard substring keyword searching #9

Open bkil opened 1 year ago

bkil commented 1 year ago

It could be worthwhile to also implement some simple edit-distance based fuzzy typo allowance & fuzzy keyword matching might be set as well. And also, if a message contains too (many) characters not participating in valid words of the sentence, that would be a red flag.

Each room is limited to a single language in 99% of the cases, thus posting foreign spam is already a red flag. This is important in the dozens of local language rooms where the indiscriminate English spammer sometimes joins as well. But also, dictionaries exist (see your package manager, or Wiktionary, Wikipedia, etc). Or you could just go through the chat log to collect words and sentences used by non-troll members in the past (=ham) to help discriminate it from unusual content (spam).

jjj333-p commented 1 year ago

this is an interesting issue, but it is far beyond my skillset. I would however love to see something like this come through, and i would love for if someone else knows how to do this they could contribute

jjj333-p commented 3 months ago

update, this might be doable in some manner, perhaps using string distance. still on the backburner but this might be the solution i to something