mkdryden / telegram-stats-bot

A simple bot that lives in your Telegram group, logging messages to a Postgresql database and serving statistical tables and plots to users as Telegram messages.
GNU General Public License v3.0
53 stars 20 forks source link

Support to non-ascii-safe names #31

Open Rafagd opened 8 months ago

Rafagd commented 8 months ago

Hi, I have started using this this bot in a personal group of friends chat and we've noticed it completely mangles the name of one of them. His name happens to contain a ç and that character was completely removed from his logged entry.

I have investigated the code and I've stumbled upon the following line:

df['User'] = df['User'].str.replace(r'[^\x00-\x7F]|[@]', "", regex=True)  # Drop emoji and @

Which basically states it's dropping emoji and the @ symbol. Not sure why that's even necessary but it's doing way more than just dropping emojis, it's dropping everything that's outside ASCII range. So no latin-alphabet extensions like é ü ø æ, and no support at all for non-latin scripts like cyrillic, greek, arabic, chinese, etc...

Is there a particular reason for this line to exist? Python and Postgres should support UTF8 just fine...

mkdryden commented 8 months ago

This was causing some kind of issue with the output text, I remember, but not exactly what it was. There have been some changes to pandas and telegram since, so it may be fine now, or I may have intended to only filter emoji or something. When I have a chance, I will check the behaviour, but if you're in a rush, it should be safe to remove that line, though the output might be broken.

Rafagd commented 8 months ago

I have already commented it out in our clone of the repo, and it seems to work out fine. I haven't tested the emoji case, but it seems to be working fine.

Em ter., 12 de mar. de 2024 05:27, mkdryden @.***> escreveu:

This was causing some kind of issue with the output text, I remember, but not exactly what it was. There have been some changes to pandas and telegram since, so it may be fine now, or I may have intended to only filter emoji or something. When I have a chance, I will check the behaviour, but if you're in a rush, it should be safe to remove that line, though the output might be broken.

— Reply to this email directly, view it on GitHub https://github.com/mkdryden/telegram-stats-bot/issues/31#issuecomment-1990557386, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAELZSTQ77462JTW6X5LPMLYX2N4ZAVCNFSM6AAAAABEQ7D47KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOJQGU2TOMZYGY . You are receiving this because you authored the thread.Message ID: @.***>