Add full UTF-8 support for cooler & more internationally-inclusive user/clan naming 💝

ciraben commented 1 year ago

    # TODO: clan name of 🐱 breaks due to incompatible unicode collation or something

TNNT's MySQL database currently encodes usernames & clan names in the outdated utf8mb3 format, limiting names to UTF characters of 3 bytes or less. This excludes emoji (😢), as well as a few Indian & indigenous alphabets and some other fun stuff like math.

The default MySQL charset is utf8, which actually isn't UTF-8. MySQL still aliases utf8mb3 as utf8 (quite the misnomer!) and currently recommends manually switching over to utf8mb4, while they toy 🧸 with updating their utf8 alias & defaults to utf8mb4 (source).

While the TNNT MySQL database only supports utfmb3, the TNNT backend is more than 😋 happy to accept 4-🧛🏻‍♀️ byte UTF-8 values from users via form 📝 fields and feed them 🩸 directly as MySQL q❔ries, leading to 🍉 juicy & convoluted 🥨 errors:

OperationalError at /clanmgmt
(1267, "Illegal mix of collations (utf8mb4_general_ci,IMPLICIT) and (utf8mb3_general_ci,COERCIBLE) for operation '='")

tl;dr - update backend MySQL database to utf8mb4 so i can have rly 🌈 clan name, pretty pretty plz 🌸

k21971 commented 1 year ago

Hi! Emoji use was disabled on purpose due to various errors cropping up with displaying them on the front-end when the new python-based backend was being developed. We could maybe take another look at this, but not for the 2023 tournament. Not keen on changing the dbase structure with less than 7 hours to go before start time.

Thanks for the detailed report, very helpful.

ciraben commented 1 year ago

heck ;-;

entrez commented 1 year ago

This really is very helpful, thank you.

Previously, the db was apparently using utf8mb4 and people were still getting errors like this when they tried using an emoji as a clan name:

Trying to create a clan with an emoji name broke stuff, I tried '🐱' (1267, "Illegal mix of collations (utf8mb4_general_ci,IMPLICIT) and (utf8_general_ci,COERCIBLE) for operation '='")

The db encoding was changed to utf8 as part of trying to debug that, I think. At the same time, there had also been some abuse of unicode characters, e.g. to create zalgo text clan names that filled the screen, and the change to utf8 prevented that, so I guess it stayed that way.

Like @K21971 said, we will have to revisit this after this year's tournament, since it would involve messing with the database and then also figuring out a way to properly block the sort of spam/abuse that was seen previously, but in the meantime, do you have any insight into what might have been happening earlier? Maybe when the database was set to use utf8mb4, an existing table wasn't altered and was still on utf8mb3, so some equality comparison wasn't working? I don't have much SQL experience, so if that error rings any bells it would probably help us figure it out during the inter-tournament period.

ciraben commented 1 year ago

Happy to share what I learned! I'm new to MySQL too - just saw an interesting error & followed it down the rabbit-hole. (And as much as it 💔s me, I totally get not fiddling today haha. You can probably guess why I noticed this now and not a month ago😅)

Anyway, if I'm not mistaken, switching from utf8mb4 to utf8mb3 (aka utf8) doesn't actually prevent Zalgo text. Zalgo wins by appending diacritics (combining characters) to a normal char. Each diacritic is stored as its own "code point", and most of these are individually utf8mb3 compatible. (utf8mb3 includes all BMP chars).

By my reading, TNNT's current anti-Zalgo fix is actually database-independent. Instead, the text_field_clean function in forms.py (here) sanitizes inputs before even making a database call. To prevent Zalgo fun, it tests for characters with more than one diacritic applied. Then it nicely tells the end-user the form can't be submitted and why.

We could do this for emoji too, but currently, emoji aren't sanitized away. Form submissions are accepted, and then break during the uniqueness check instead, with a crazy-ugly error page for the end-user.

So if the intention is to reject submissions containing emoji, we should add a check to text_field_clean along with a nice forms.ValidationError message.

Or easier - if you just want to suppress the nasty error page, I think you can just set DEBUG = False in settings.py here.

entrez commented 1 year ago

Ah, great catch -- the whole situation predates my involvement so I was just going by some internal discussions and older IRC log snippets. I guess the change back to utf8 was just part of an attempt to debug the emoji issue, then, and it just never got changed back (maybe because nobody could figure out what caused the original errors so didn't think it made a big difference, or maybe because it was forgotten about -- that change happened at a really work-intensive part of launching the new backend, I think, so it may have just slipped by the wayside since nobody could think of an easy fix).

We will definitely circle back on this after this year's tournament and try to track down the root cause of the original errors, and in the meantime we'll discuss whether it's feasible to deploy a band-aid to prevent the verbose error messages now even though the tournament has already begun. That may have to wait until later, too, but we'll talk it over. Thanks for bringing this to our attention and your help with it! Sorry that you're still going to be deprived of emoji clan names this year.

entrez commented 1 year ago

We now have a temporary fix (based on your suggestion) that should prevent the huge backtraces and actually inform people about what is being disallowed. And hopefully it won't be too bad figuring out those "illegal mix of collocations" errors when we try actually changing the db to utf8mb4 later. We'll come back to that at some point after the tournament. Thanks again for the help you've given us with this already!

ciraben commented 1 year ago

Looks good!

tnnt-devteam / python-backend

Add full UTF-8 support for cooler & more internationally-inclusive user/clan naming 💝 #25