Allow Unicode in nicknames

ttepasse commented 8 years ago

RFC 1459 only allows ASCII letters, numerals and some special characters in Nicknames, leaving people from non-anglophone countries at a disadvantage. Using the wealth of human writing is possible in the body of messages, it should be possible in the nicknames too.

SadieCat commented 8 years ago

There are existing implementations of this (e.g. InspIRCd's m_nationalchars) but nothing standard. I believe that @DanielOaks was looking into trialling RFC 3454 in @mammon-ircd with a desire for standardising it though.

It isn't as simple as just allowing it though. Compatibility is a concern (there are clients which break when they get a CASEMAPPING which is not ascii or rfc1459) as well as masquerading with characters that look similar (e.g. character 97 "a" looks very similar to character 1072 "а").

clokep commented 8 years ago

There's also cases of servers improperly implementing rfc1459 vs. strict-rfc1459 (see inspircd/inspircd#1017).

Ideally, wouldn't we want this to match how it is done for channel names?

TingPing commented 8 years ago

For what its worth I made a test branch for hexchat supporting rfc3454 though no network implements it afaik to try it.

clokep commented 8 years ago

For what it's worth, we just experimented a bit on moznet and things like the zero-width space character is accepted as a valid room name...which shows up as an empty in whois:

(Additionally, there's also a channel which is just the prefix, #, which is a bit funky.)

screen shot 2016-05-03 at 2 06 50 pm

dequis commented 8 years ago

Relevant reading:

UTR 36: Unicode Security Considerations

UTS 39: Unicode Security Mechanisms

Bitlbee has a 'utf8_nicks' setting, disabled by default and with a small warning about potential breakage in the help text. It doesn't perform any cleanup, deferring that to the IM server (XMPP for example cleans them with the nodeprep/resourceprep profiles of stringprep), but i'd really like to change this.

I haven't heard of clients with big issues when enabling this, just minor visual issues like miscalculating the width when displaying the nicks in a terminal.

MicroDroid commented 8 years ago

How are we going to maintain the backward compatibility? I'd upvote this otherwise

grawity commented 8 years ago

In practice, it's probably already compatible because many clients don't care.

MicroDroid commented 8 years ago

Hmm, well then this should really be in IRCv3.2, it's awesome

DanielOaks commented 8 years ago

masquerading with characters that look similar (e.g. character 97 "a" looks very similar to character 1072 "а").

With rfc3454 casemapping I believe we use the nameprep profile to prevent issues like this. It would be good to read through documents like those in detail to make sure we do things right if we're standardising it though.

So long as you continue to disallow characters that break the protocol (i.e. commas, periods in client names, etc), and reject nicks/channel names that fail to casefold (i.e. strings that fail because they contain a character prohibited by the profile), I haven't seen too many issues with it.

kaniini commented 8 years ago

In charybdis, we plan to implement rfc7700 "casemapping", which is the same as rfc3454 nameprep except using IDN2008 rules, with specific requirements for "nicknames".

TingPing commented 8 years ago

How are we going to maintain the backward compatibility?

It is a joke but it is a solution, convert it to punycode (or similiar) for non unicode clients.

In practice, it's probably already compatible because many clients don't care.

Not sure what you mean by that, many clients respect the casemapping and rely upon its behavior.

DanielOaks commented 8 years ago

@kaniini That makes sense, once it's implemented/specced out give me a yell and I can see about switching my personal stuff over to use it as well.

kaniini commented 8 years ago

How are we going to maintain the backward compatibility?

There is no plan in charybdis for backwards compatibility. Deployments which switch from rfc1459 to rfc7700 casemapping will assume clients support UTF-8 properly. Networks will decide on their own when to make the switch, or whether to make it at all.

Mikaela commented 8 years ago

How would tab completion work if someone used a nick on international channel that is not in latin alphabet?

What if the client is configured to use not-UTF-8-charset?

What if the the person using UTF-8 nick uses something that my client cannot show due to old glibc in my system which is in the wild? (ref: https://github.com/weechat/weechat/issues/79)

DanielOaks commented 8 years ago

How would tab completion work if someone used a nick on international channel that is not in latin alphabet?

Presumably the same as it does with the latin alphabet.

What if the client is configured to use not-UTF-8-charset?

I think detecting a specifically UTF-8-based casemapping from the server should make the client default to using UTF-8, if they're not already. If the user decides not to use it, they may get corrupted characters, just like what happens today when two clients using utf8 and non-utf8 try to send weird characters to each other.

What if the the person using UTF-8 nick uses something that my client cannot show due to old glibc in my system which is in the wild?

Then it will not show those characters because your client (or the system you're using it on) does not work properly. I don't think this is an issue for us to worry about, it's a bug that will get fixed by more distros over time, and I especially think will be fixed enough for us to not care about it by the time a unicode casemapping actually gets into proper usage.

attilamolnar commented 8 years ago

I fully support moving away from legacy rfc1459 towards rfc7700.

grawity commented 8 years ago

How would tab completion work if someone used a nick on international channel that is not in latin alphabet?

Not sure if possible, but ideally a<Tab> would also include nicks beginning with ą ã å あ etc., similar to how in some clients it already skips over any leading punctuation (a<Tab> → [Attila]).

What if the client is configured to use not-UTF-8-charset?

Clients which support CASEMAPPING=rfc7700 would always decode nicknames as UTF-8, regardless of the configured message encoding.

Existing clients would work the same way they already do when someone sends a UTF-8 message (i.e. some would detect UTF-8 anyway, others would mis-decode it as ISO-8859-42 or whatever such).

What if the the person using UTF-8 nick uses something that my client cannot show due to old glibc in my system which is in the wild?

🤷

I guess it'd be less likely to happen if only "word" characters were accepted, similar to how Python etc. filter characters allowed in variable names.

DanielOaks commented 8 years ago

ideally a<Tab> would also include nicks beginning with ą ã å あ etc., similar to how in some clients it already skips over any leading punctuation (a<Tab> → [Attila]).

So long as the client takes the casefolding into account when evaluating tab-complete matches, should work without an issue I'd imagine.

MicroDroid commented 8 years ago

Maybe we can do some math in the IRC server to create an alias and send to the client? so the client uses the alias to complete the actual nick?

Like if the nick is ąãå, then the IRC server does some math and create aaa out of it, and send to the client during NAMES or something

This way the user can just put like aa<Tab> → ąãå

Or, the math part might be left up to the clients, as the whole thing is really client side anyways.

clokep commented 8 years ago

I'd suggest it's just up to the clients to implement tab completion in a sane manner. UI interfaces shouldn't be speced in a protocol.

MicroDroid commented 8 years ago

Right. So either way this problem is avoidable.

grawity commented 8 years ago

Hmm, how do people use tab-completion in the existing ISO-2022-JP networks?

Mikaela commented 8 years ago

Like if the nick is ąãå, then the IRC server does some math and create aaa out of it, and send to the client during NAMES or something

This way the user can just put like aa → ąãå

Wouldn't this just mean that aaa / ąãå were the same nick and all variations of ąãå which the IRCd would interpret to aaa and get very confusing? This is why I gave :-1: to your comment.

RyanSquared commented 8 years ago

Like if the nick is ąãå, then the IRC server does some math and create aaa out of it, and send to the client during NAMES or something

This way the user can just put like aa → ąãå Wouldn't this just mean that aaa / ąãå were the same nick and all variations of ąãå which the IRCd would interpret to aaa and get very confusing? This is why I gave :-1: to your comment.

Another idea could be to treat it like capitalizations? ąãå == aaa == AAA but it's not translated to the same characters.

kaniini commented 8 years ago

Proposed client behaviour would be in a non-normative part of the spec at best, so it's not even worth bothering with. I suspect with the way this discussion is going, this will be an area where the IRCv3 process fails us and we just form a coalition of IRCd vendors to make it happen, and then IRCv3 maybe documents it after the point.

grawity commented 8 years ago

So business as usual, then?

DanielOaks commented 8 years ago

Pretty much what @kaniini says. It's not a huge issue to worry about.

sdaugherty commented 8 years ago

I'd still be concerned about breakage - even clients which support UTF-8 messages likely have made assumptions about nicknames, particularly any clients which support tab completion or which maintain a cached member list for channels for some purpose. I'd be afraid that this is likely to expose a lot of undefined behaviors around input sanitation of nicknames received from the server (or the lack thereof).

Some possible manifestations of incompatibility with UTF-8 nicknames

Commands applied to the wrong user
Broken tab completion
"null" users in internal caches
garbled characters

Some of these issues already exist today with channel names, and chat messages, but nicknames are more fundamental, as they are identifiers that the client absolutely has to deal with correctly - if a channel name breaks a client, the user can avoid that channel, a user can't necessarily choose to avoid all users with UTF-8 nicknames.

There's also a severe usability concern that needs to be addressed - a channel operator MUST be able to quickly and unambiguously specify nicknames for use in commands with only keyboard input, regardless of what language's characters might happen to be in those nicknames. Even if that client properly supports UTF-8 nicknames, if the use of such nicknames complicates the effective management of channels in the slightest, then user acceptance of internationalized nicknames will either be dead in the water as a feature users rebel against, or there will be demands for restrictive channel modes to prohibit all internationalized nicknames on a channel..

(Yes, I realize that in most cases, a user has access to a GUI, tab completion, or copy/paste, but there is no guarantee of this - there are environments where none of these will be a viable option. Tab completion, for example, often requires the user specify at least a partial match, or requires them to iterate through every nickname on the channel, copy/paste may not be available if the user is at an actual console session rather than running a terminal inside a GUI, GUI userlists aren't available in a terminal, and so on.)

kaniini commented 8 years ago

rfc7700, when properly implemented, handles all of those issues and more. have you read it?

sdaugherty commented 8 years ago

I have, and it is so extremely light on practical details about exactly how it would be implemented within the IRC protocol that it leaves more questions than answers.While IRC is mentioned as a possible application, aside from that mention, the rest of the RFC consists of a set of guidelines that can be generically applied to problems inherent with nickname internationalization. across a wide variety of existing and future protocols.

While the specifications set out in the RFC address a number of potential issues, the lack of any formal guidance of how to integrate them into the IRC protocol, combined with a lack of IRC specific recommendations effectively make it nothing more than a building block, and my concerns from a user standpoint above about IRC-specific implementation details remain at most partially addressed by RFC7700.

Of more concern, there are some security considerations that should be readily apparent to any long time user of IRC, which are not mentioned - specifically, the potential for disruption if the effective use of channel management and ignore functionality is obstructed or defeated by internationalized nicknames. This is especially important here because users might first have to learn how to deal with inputting i18n nicknames while under the pressure of on ongoing disruption or attack.

Any demonstration or reference implementations will have to be especially aware of these and other considerations, to avoid an implementation that is perceived as creating more problems that it solves.

realJoshByrnes commented 8 years ago

If you look at the IRCX Draft v04 (Microsoft, 1998) it provides a way to allow Unicode nicknames in IRC.

Client's that don't support Unicode (non-IRCX in the draft) see: 1) Non unicode nicknames as usual 2) Unicode nicknames as '^' followed by the hex representation of the nickname.

This has been supported in many clients / servers since the 1990s, why not use it?

TingPing commented 8 years ago

This has been supported in many clients / servers since the 1990s, why not use it?

Curious which ones?

Marqin commented 8 years ago

Just remember to ban for security reasons all Unicode confusable symbols (allow only one version of those chars).

dpyro commented 8 years ago

What about handling emojis? 👸🏻 may appear as either multiple characters or a single character while being visually different or identical to 👸depending on the system or application support. Additionally, many clients will use a shortcode such as :joy: to ease input of emoji. If a user wants to join #😂, they may use #:joy: which would be a different room entirely.

Both #😂 and #:joy: appear to be valid and distinct channel names on QuakeNet. 💩.la is a valid domain name and website link on my system (macOS/Safari).

DanielOaks commented 8 years ago

Shortcodes are handled explicitly by the client (i.e. if the client wants to convert them then cool), the protocol doesn't treat shortcodes any differently or give them any special conversion. At least in #272 right now, it allows emoji as a part of names so far as rfc7700 does, but servers are free to block whatever characters they want.

fantasai commented 7 years ago

@grawity NFKD + case folding is likely to help with the tab completion for accented characters. Decomposing characters will let you handle diacritics by either matching or skipping them, and compatibility decompositions will handle a lot of other stuff. (It's designed for search operations.) See http://unicode.org/reports/tr15/

But Unicode mapping tables won't handle things like a -> あ, since they don't have tables for romanization of non-Latin scripts. It's not really an easily standardizable thing... in many languages, there's multiple possibilities; e.g. Chinese has several formalized romanization schemes in common use, and Persian is romanized rather haphazardly by Persian-speakers.

jwheare commented 6 years ago

Worth considering whether just using a metadata key might resolve this sufficiently. e.g. display-name described here: #336

ircv3 / ircv3-specifications

Allow Unicode in nicknames #259