Open ciraben opened 1 year ago
Hi! Emoji use was disabled on purpose due to various errors cropping up with displaying them on the front-end when the new python-based backend was being developed. We could maybe take another look at this, but not for the 2023 tournament. Not keen on changing the dbase structure with less than 7 hours to go before start time.
Thanks for the detailed report, very helpful.
heck ;-;
This really is very helpful, thank you.
Previously, the db was apparently using utf8mb4 and people were still getting errors like this when they tried using an emoji as a clan name:
Trying to create a clan with an emoji name broke stuff, I tried 'π±' (1267, "Illegal mix of collations (utf8mb4_general_ci,IMPLICIT) and (utf8_general_ci,COERCIBLE) for operation '='")
The db encoding was changed to utf8 as part of trying to debug that, I think. At the same time, there had also been some abuse of unicode characters, e.g. to create zalgo text clan names that filled the screen, and the change to utf8 prevented that, so I guess it stayed that way.
Like @K21971 said, we will have to revisit this after this year's tournament, since it would involve messing with the database and then also figuring out a way to properly block the sort of spam/abuse that was seen previously, but in the meantime, do you have any insight into what might have been happening earlier? Maybe when the database was set to use utf8mb4, an existing table wasn't altered and was still on utf8mb3, so some equality comparison wasn't working? I don't have much SQL experience, so if that error rings any bells it would probably help us figure it out during the inter-tournament period.
Happy to share what I learned! I'm new to MySQL too - just saw an interesting error & followed it down the rabbit-hole. (And as much as it πs me, I totally get not fiddling today haha. You can probably guess why I noticed this now and not a month agoπ )
Anyway, if I'm not mistaken, switching from utf8mb4
to utf8mb3
(aka utf8
) doesn't actually prevent Zalgo text. Zalgo wins by appending diacritics (combining characters) to a normal char. Each diacritic is stored as its own "code point", and most of these are individually utf8mb3
compatible. (utf8mb3
includes all BMP chars).
By my reading, TNNT's current anti-Zalgo fix is actually database-independent. Instead, the text_field_clean
function in forms.py
(here) sanitizes inputs before even making a database call. To prevent Zalgo fun, it tests for characters with more than one diacritic applied. Then it nicely tells the end-user the form can't be submitted and why.
We could do this for emoji too, but currently, emoji aren't sanitized away. Form submissions are accepted, and then break during the uniqueness check instead, with a crazy-ugly error page for the end-user.
So if the intention is to reject submissions containing emoji, we should add a check to text_field_clean
along with a nice forms.ValidationError
message.
Or easier - if you just want to suppress the nasty error page, I think you can just set DEBUG = False
in settings.py
here.
Ah, great catch -- the whole situation predates my involvement so I was just going by some internal discussions and older IRC log snippets. I guess the change back to utf8 was just part of an attempt to debug the emoji issue, then, and it just never got changed back (maybe because nobody could figure out what caused the original errors so didn't think it made a big difference, or maybe because it was forgotten about -- that change happened at a really work-intensive part of launching the new backend, I think, so it may have just slipped by the wayside since nobody could think of an easy fix).
We will definitely circle back on this after this year's tournament and try to track down the root cause of the original errors, and in the meantime we'll discuss whether it's feasible to deploy a band-aid to prevent the verbose error messages now even though the tournament has already begun. That may have to wait until later, too, but we'll talk it over. Thanks for bringing this to our attention and your help with it! Sorry that you're still going to be deprived of emoji clan names this year.
We now have a temporary fix (based on your suggestion) that should prevent the huge backtraces and actually inform people about what is being disallowed. And hopefully it won't be too bad figuring out those "illegal mix of collocations" errors when we try actually changing the db to utf8mb4 later. We'll come back to that at some point after the tournament. Thanks again for the help you've given us with this already!
Looks good!
See views.py, line 841:
TNNT's MySQL database currently encodes usernames & clan names in the outdated
utf8mb3
format, limiting names to UTF characters of 3 bytes or less. This excludes emoji (π’), as well as a few Indian & indigenous alphabets and some other fun stuff like math.The default MySQL charset is
utf8
, which actually isn't UTF-8. MySQL still aliasesutf8mb3
asutf8
(quite the misnomer!) and currently recommends manually switching over toutf8mb4
, while they toy π§Έ with updating theirutf8
alias & defaults toutf8mb4
(source).While the TNNT MySQL database only supports
utfmb3
, the TNNT backend is more than π happy to accept 4-π§π»ββοΈ byte UTF-8 values from users via form π fields and feed them π©Έ directly as MySQL qβries, leading to π juicy & convoluted π₯¨ errors:tl;dr - update backend MySQL database to
utf8mb4
so i can have rly π clan name, pretty pretty plz πΈ