feature request: add multi-language antispam support

Lopa10ko commented 1 day ago

Problem

The detector is incapable of determining the prevailing language of the message. Consequently, if a spam message is written in English, it will be automatically converted into Cyrillic and will not be flagged as spam.

[!NOTE] Generating homoglyphs could be more informative using existing frameworks (e.g. https://github.com/life4/homoglyphs), starting by identifying the primary locale of the message.

Reproduction

For example, at this stage, the following test fails:

import pytest

from itmo_antispam_bot.rubert_bot import SpamDetector

@pytest.mark.parametrize('message, expected', [
    ('''Hello guys, Consider we a have time series with frequency of daily data.
    What is the minimum amount data required for fedot to forecast well?''', False),
    ('''Unlock the secrets to making millions with our exclusive Crypto Masterclass!
    Learn how to turn a small investment into life-changing wealth.
    Plus, get a FREE $500 bonus just for signing up today!
    Don't miss out on this limited-time opportunity—start your journey to financial freedom now!''', True),
    ('''Раскройте секреты заработка миллионов с нашим эксклюзивным курсом по криптовалюте!
    Узнайте, как превратить небольшие инвестиции в жизнеопределяющее богатство.
    А еще получите БОНУС $500 абсолютно бесплатно при регистрации сегодня!
    Не упустите шанс начать путь к финансовой свободе прямо сейчас!''', True)
])
def test_spam_classifieir(message, expected):
    classifier = SpamDetector('NeuroSpaceX/ruSpamNS_v1')
    assert classifier.classify_message(message) == expected

jrzkaminski commented 1 day ago

The bot is built for Russian language. However, I'll consider that improvement

Lopa10ko commented 1 day ago

The bot is built for Russian language. However, I'll consider that improvement

I was thinking about how spammers could use the "write spam messages in English in a Russian-speaking chat" strategy :)

jrzkaminski / itmo-os-antispam-bot

feature request: add multi-language antispam support #1

Problem

Reproduction