[feature] Allow for more diverse usernames

daenney commented 1 year ago

Is your feature request related to a problem ?

Currently, we have some strict limits in GtS on usernames. Specifically, it must be at least 2 characters and it's only allowed to be lowercase ASCII a-z and the numbers 0-9. Our validation erroneous says "lowercase letters", which is a lot more than just lowecase ASCII letters.

Though this is not enforced on the Account model itself, creation will fail due to our Username validation function https://github.com/superseriousbusiness/gotosocial/blob/ea1bbacf4b51628f55bc831f511ce60ddb72d71c/internal/validate/formvalidation.go#L74-L84 https://github.com/superseriousbusiness/gotosocial/blob/ea1bbacf4b51628f55bc831f511ce60ddb72d71c/internal/regexes/regexes.go#L103-L104 https://github.com/superseriousbusiness/gotosocial/blob/ea1bbacf4b51628f55bc831f511ce60ddb72d71c/internal/regexes/regexes.go#L50

Describe the solution you'd like.

I'd like to propose we change this in order to allow a more diverse set of usernames and allow people using different writing systems to register an account without the obligation to ASCIIfy/Romanise their username.

To that end, I'd like us to:

Lift the restriction on 2 characters as in some writing systems a single character can express a whole lot. I'm not quite sure why 2 was deemed OK but 1 was not, but I'm curious as to any historical context here. That would also resolve #1691
Allow unicode letters and numerals in the username when signing up, much like how we allow them in incoming account names, you can search for them etc. (See for example #1743)

One thing I would like to avoid is that by supporting unicode letters and not just ASCII it becomes possible to construct a username that's visually identical to another, but leverages a different writing system. For example having the username adam but swapping the Latin a out for the Cyrillic a. This is typically done for the purpose of spam or phishing and something I'd like to ensure GtS doesn't enable by default. On the bright side, doing this will make your user harder to find when searching so it does partially defeat itself. To that end I would propose that any implementation we do here has a configurable threshold for how many different scripts may occur in a username, and default that to 1 (excluding digits).

On account creation we should also be careful to ensure you can't sign-up with the same account just because you happened to get a different Unicode representation of the same username. For that to work, we need to normalise to NFC wherever we get the username in (sign up, login etc.). Normalising to NFC also makes the string more compact so it's more ideal for storage. Most of the client API only receives a token and then looks you up, so we don't have to worry about it there.

Describe alternatives you've considered.

Not do any of this.

Additional context.

No response

decentral1se commented 1 year ago

Thanks for laying it all out @daenney :tada:

Lift the restriction on 2 characters as in some writing systems a single character can express a whole lot. I'm not quite sure why 2 was deemed OK but 1 was not, but I'm curious as to any historical context here. That would also resolve https://github.com/superseriousbusiness/gotosocial/issues/1691

I opened https://github.com/superseriousbusiness/gotosocial/pull/1823 for this part.

daenney commented 1 year ago

One other thing I just realised; in parts of the code base we do explicit LOWER(?) for values in the SQL when running certain comparisons on account names amongst other things.

This will be a problem as LOWER on SQLite is only guaranteed to do what we hope when used with ASCII. LIKE in SQLite is case-insensitive but case-sensitive in PG and by default doesn't handle unicode case folding in SQLite unless you have the ICU extension. There's ILIKE in PG but that has performance issues at times so you'll often see ~* instead. But that means writing separate queries for SQLite vs. PG which is tedious and bound to go wrong eventually.

(A slight aside is that using LOWER can cause indexes to not be used unless the index is also defined with lower() for that value)

Since solving this at the DB level is all a bit of a mess as long as we support both SQLite and PG, it's probably best to do the lower-casing ahead of time in Go and ensure we only store account and usernames as lower cased in the DB.

NyaaaWhatsUpDoc commented 1 year ago

This does mean we'll either need to only support lowercase (of all unicode), or a separate username column for the actual un-lowered version of the username. The latter feels kind of hacky but also i don't want to make any sweeping claims as i don't know whether enforcing lower-cased unicode in for example non-latin alphabets would result in unexpected results :thinking:

daenney commented 1 year ago

That's a good question. Does anyone know which implementations may already support unicode usernames? Might be helpful to take a peak at how they handle it.

daenney commented 1 year ago

Interestingly enough, I ran into https://social.treehouse.systems/@marcan/110524048419426592 recently which raises a good point about usernames.

There's a reason no native Japanese SNS/platform uses kanji/kana usernames as identifiers. Nobody wants to have to spell out exactly how to type/convert someone's name in order to add them as a friend, with the zillion extra dimensions you get when you allow CJK characters. Spelling out Japanese text to be bit-identical is ridiculously harder than doing the same for ASCII, and completely impractical for less technically minded people who might not be able to tell apart certain fullwidth/halfwidth variations (especially with proportional fonts). And don't get me started on Unicode normalization!

Quick quiz: Are these the same username?

ふたばふたばふたはﾞふたは゛

Would you design your username system to normalize them to the same username? Would that normalization work well in all cases? If not, would you know how to tell them apart? How would you spell them out individually such that someone on the other end knows how to type them without just copy+paste? (Hint: it's not even possible for all variations without a character picker)

This raises a good point and one I had kinda forgotten about; in many places with other writing systems, usernames tend to still be ASCII. The example given here for Japanese being a good example.

Both in French and in Swedish I don't think I've ever encountered a username using something like a C-cedilla or an å. Similarly in Greek, usernames are typically ASCII and I imagine that's the same in most Slavic languages too for the same reason. The example in #1735 showed the case of punycode domain names and we do need to handle those properly. But it didn't include an example of a username using an extended character set and I'm wondering if those actually exist in other implementations.

Obviously people can set their display name to whatever they like using whatever character set, but maybe it actually makes sense to keep our usernames constrained to ASCII?

igalic commented 1 year ago

the other issue when allowing all of Unicode is that you then need to check for fakes (as soon as we allow unattended sign ups, anyway)

meena vs мееиа

daenney commented 1 year ago

the other issue when allowing all of Unicode is that you then need to check for fakes (as soon as we allow unattended sign ups, anyway)

meena vs мееиа

Indeed. In the case of mixed-alphabet we can detect that pretty easily because that's a dead giveaway. But full meena in Latin vs full мееиа in Cyrillic becomes a lot harder. I'm not really sure at which point you'd consider them different enough. This may not be a problem in the end if we don't ever allow unattended sign-ups, i.e without a moderator/admin approving it, but it's maybe not a can of worms to open to begin with.

Ember-ruby commented 3 months ago

tbh this should be up to the instance operator (config variable maybe)

i know that at least, being able to use your preferred capitalization for your username is pretty common

not seeing any usernames using other characters than ASCII may well be because practically everything denies it

impersonation, well yeah that is a risk, but i don't see it being that much higher than your regular open register server, or federation in general (creating a fake account on the same server as the user is very likely to be quickly spotted, while finding a badly moderated large open reg instance is easy)

Fastidious commented 3 months ago

Specifically, it must be at least 2 characters

This would be nice to change, why this limitation? For example, I follow at least one person whose username is d@wafer.baby.

superseriousbusiness / gotosocial