Recommend using UTF-8 - Githubissues

xPaw commented 5 years ago

Here are some thoughts in no particular order from today's IRC discussion:

Currently charset/encoding support is a mess on IRC because it was never defined, and clients do not communicate which charsets they use (which lead to hacks like casemappings)
Some IRCv3 caps will want to use unicode (reactions, display names, etc)
Mixing UTF-8 for tags only, and other encoding for the message itself is not a particularly good idea
Legacy encodings (but not all of them) include the base ASCII set (<128 bytes), which makes UTF-8 backwards compatible here
It is impossible to detect used encoding without knowing which one exactly was used
If UTF-8 is enforced, it is impossible to make it fully backwards compatible with all legacy encodings (thus leading to question marks when decoding)
If servers start storing messages for chathistory backlog and message searching, they will want to use a proper encoding there as well

I am most likely missing more points, but that's what I currently remembered from top of my head.

The current situation is pretty sad, in the way that a lot of clients try to decode UTF-8 and if it fails, they fallback to one of the common encodings (like latin1), which still leads to messy results. Enforcing (or at least strongly suggesting) use of unicode will increase interoperability and compatibility between all kinds of IRC software.

Legacy charsets are not compatible with one another, and are super limited. A glaring example of this is emojis. If a client wants to correctly send and display them, it needs to deal with unicode correctly.

DarthGandalf commented 5 years ago

Most (if not all encodings) include the base ASCII set (<128 bytes), which makes UTF-8 backwards compatible here

ISO-2022-JP is an example encoding which does not work that way, even though it uses only values < 128.

TingPing commented 5 years ago

The servers and clients that use not-utf8 are the ones that clearly don't care what ircv3 says IMO.

(To be clear, duh recommend it, but it won't change anything)

xPaw commented 5 years ago

For reference, ZNC had an issue where a decoded line using cp1140 could contain new lines where they were not present in the raw buffer.

https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2018-14055

Looking at https://en.wikipedia.org/wiki/EBCDIC_037 the issue probably boiled from "half of the control character codes can be translated into their exact ASCII equivalents".

xPaw commented 4 years ago

Here's another point: websockets spec specifies UTF-8.

https://github.com/ircv3/ircv3-specifications/pull/342#issuecomment-647514606

GIJack commented 4 years ago

We should use UTF-8 as its pretty much almost universally used by almost everything, including posix locales.

I can in fact use UTF-8 on the modern *NIX command line. Anything copy and pasted from IRC should work with modern text editors, which again, use UTF-8

This should really be in IRCv3 as a spec. Should also note using codepages, for any language is obsolete, and clients discouraged or forbbiden in protocol for using them.

SadieCat commented 2 years ago

We now have the UTF8ONLY token which works great for new servers which only want to support UTF-8 but existing servers with non-UTF-8 clients need a way of migrating to this.

I propose we add a similar token which allows legacy servers to request UTF-8 from clients in a non-binding way to allow supporting clients to switch their connection configs to only send UTF-8 but still accept non-UTF-8 from other clients. This will allow existing servers to migrate to UTF-8 over time.

(And no, "fuck non-UTF-8 users" isn't an option for networks where people speak non-Western languages).

slingamn commented 2 years ago

This proposal reminds me of STS. STS is a well-designed specification with careful attention to backwards compatibility considerations, but I don't see much evidence that it has actually led to greater adoption of TLS. It's not widely implemented by client developers, but more to the point, legacy clients that can't/won't upgrade to an STS-capable version (old versions of mainstream clients, bots, and other ad-hoc setups) are likely disproportionately represented among plaintext connections. So STS is not really a feasible transition mechanism to a TLS-only world. If you want to go TLS-only, at some point you have to tear off the band-aid and stop supporting plaintext, leaving those users out in the cold.

I think any proposal for an incremental migration to UTF8ONLY would face more or less the same concerns. Server operators could likely get equal or better results through a campaign of "nudging" users (via the MOTD and/or global notices) to reconfigure their clients.

isn't an option for networks where people speak non-Western languages

As I understand it, the main alternative to UTF8 in current use is Latin-1 (probably followed distantly by Shift-JIS), so it's not really "non-Western" languages that are affected.

GIJack commented 2 years ago

UTF-8 support should be mandated for advertising IRCv3 compatibility. There should be fallback modes for old implementations.

There should be support for a UTF8ONLY flag/setting, and it should be entirely optional. all previous encodings should be marked as "legacy"

SadieCat commented 2 years ago

I think any proposal for an incremental migration to UTF8ONLY would face more or less the same concerns. Server operators could likely get equal or better results through a campaign of "nudging" users (via the MOTD and/or global notices) to reconfigure their clients.

I'm going to be direct: this might work for small networks with a dozen or so users like Ergo has but large networks do not want to do this and we need a migration path for large networks or they will never adopt UTF-8 which impedes support of any features which rely on it (see also: websockets).

I've actually seen data for nudging clients to migrate as we did this on Snoonet when upgrading it to InspIRCd v3 but even after months of nudging there was still 5%~ of total users (around 1/4 of broken users) who had not upgraded and who's clients broke when we finally pushed the upgrade. Switching encoding is an even bigger breakage than what we did then so it would be very disruptive to hundreds of users (or potentially thousands on networks like Libera).

slingamn commented 2 years ago

I think I understand the scope of the problem. I'm saying that there is no solution: a spec that claims to solve it will just add clutter.

SadieCat commented 2 years ago

I'm saying that there is no solution: a spec that claims to solve it will just add clutter.

You made a bunch of unsubstantiated incorrect claims about STS (several clients including AdiIRC, mIRC, and IRCCloud all support it and in HexChat it is currently awaiting review) and incorrectly applied these to a proposal which isn't even related.

jesopo commented 2 years ago

This proposal reminds me of STS. STS is a well-designed specification with careful attention to backwards compatibility considerations, but I don't see much evidence that it has actually led to greater adoption of TLS. It's not widely implemented by client developers, but more to the point, legacy clients that can't/won't upgrade to an STS-capable version (old versions of mainstream clients, bots, and other ad-hoc setups) are likely disproportionately represented among plaintext connections. So STS is not really a feasible transition mechanism to a TLS-only world. If you want to go TLS-only, at some point you have to tear off the band-aid and stop supporting plaintext, leaving those users out in the cold.

this isn't really the goal of sts, it's about helping incorrectly configured modern clients. what will happen over time, as software naturally gets updated or the machines with outdated clients die, is clients will shift further towards being utf8 only until all that remains is people that pushed the wrong button.

As I understand it, the main alternative to UTF8 in current use is Latin-1 (probably followed distantly by Shift-JIS), so it's not really "non-Western" languages that are affected.

Shift JIS is a character encoding for the Japanese language not sure where in the west japan is. anyway I don't really care about it mostly affecting "non-Western" languages, I care about it mostly affecting people that don't speak English.

slingamn commented 2 years ago

What would you estimate as a realistic timeframe for either of these proposals "working", i.e. either

A large network deploys STS, and eventually STS succeeds in transitioning enough plaintext traffic to TLS that you'd be comfortable disabling plaintext
A large network deploys a "UTF8 recommended" token, and eventually it succeeds in transitioning enough non-UTF8 traffic to UTF8 that you'd be comfortable deploying UTF8ONLY?

SadieCat commented 2 years ago

STS is not the same because it only creates a time-limited redirect for clients that do not connect securely. There is no transition period with STS because it does not migrate clients permanently. Assuming your users' clients support it you can just deploy it and disable connection registration on plaintext ports like testnet.inspircd.org does on 6667.

My proposal is that clients that see a UTF-8-preferred token should change their configuration permanently to send UTF-8 whilst still being able to receive non-UTF-8 from the server (e.g. via PRIVMSG). This migrates users' client configuration over time to use UTF-8.

slingamn commented 2 years ago

Assuming your users' clients support it you can just deploy it and disable connection registration on plaintext ports like testnet.inspircd.org does on 6667.

That's the hard part, though --- the assumption that all clients will support it. If you do this then you are breaking connectivity for all clients that are configured to use plaintext and don't support STS. With the current state of STS adoption, I imagine this is a dealbreaker for some large networks (Libera?) that currently offer plaintext. So the question remains: what level of STS adoption would render this proposal feasible?

My proposal is that clients that see a UTF-8-preferred token should change their configuration permanently to send UTF-8 whilst still being able to receive non-UTF-8 from the server (e.g. via PRIVMSG). This migrates users' client configuration over time to use UTF-8.

If the end goal is to enable UTF8ONLY, then the question remains: what level of client adoption would be necessary to reach the end goal? If that's not the end goal, then what is? Is anything achieved in the meantime by a partial transition to UTF8?

slingamn commented 2 years ago

I think the realistic migration path to UTF8ONLY involves (after an initial period of nudging) server-side transcoding.

webircproxy implements three configurable transcoding algorithms. The one most relevant to this issue is:

If the message is valid UTF-8, assume it is in fact UTF-8 and pass it through unmodified
If it's not, parse it as IRC. Leave the tags and the command unmodified (since they are already required to be UTF-8). If the prefix is valid UTF-8, leave it unmodified, otherwise transcode it to UTF-8 using the Unicode replacement character. (Servers may be able to skip this step if they can independently guarantee that prefixes are UTF-8.)
Apply the following algorithm to each parameter independently. If the parameter is valid UTF-8, assume it is in fact UTF-8 and leave it unmodified. Otherwise, apply the Mozilla chardet algorithm and transcode from the detected encoding to UTF-8. If chardet fails, transcode to UTF-8 using the Unicode replacement character.
Reassemble the message and send it, truncating to meet the relevant line length limit (since transcoding may have increased the message length) and taking care not to truncate in the middle of a UTF-8-encoded codepoint.

Initial benchmarking (on a decade-old CPU) is encouraging. If a message is already UTF-8, this can be validated in less than a microsecond. The slow path (for messages that actually require transcoding) costs approximately 150 microseconds per message, which seems acceptable.

The other two modes of operation are significantly faster, around 5 microseconds per message: either transcode using a fixed encoding, or transcode using the Unicode replacement character.

TingPing commented 2 years ago

If I may be salty for a second.

Clients should just default to strict-utf8 (with per-network ways to be strict-otherencoding). A few people will be upset for a year or two until updates make it to users, and then the majority will just have a sane client. HexChat did this in 2016, the world didn't end, people didn't mass-exodus from the software, and as of today people sending weird hybrid encodings to hexchat are treated as the weird broken text they are.

All of this planning about specs and trying to make everybody happy will go nowhere.

jesopo commented 2 years ago

I wish you hadn't been salty about the entire process IRCv3 is undertaking for incremental and backwards-compatible improvements, but here we are.

A lot of the problem here is the amount of non-UTF8 data in the wild is mostly uncountable but I would assume a lot of it is from people using outdated clients on weird networks and probably using stuff like mIRC rather than hexchat. unquantifiable problem, but much as I can't assert it's a massive problem, you can't assert that it isn't.

TingPing commented 2 years ago

but I would assume a lot of it is from people using outdated clients on weird networks and probably using stuff like mIRC rather than hexchat.

Then what can we do about it? They will never get new client features, they connect to servers that probably never get new features. They are in a bubble and will continue to be.

jesopo commented 2 years ago

sure, but it leaves them in a stalemate. the server can't update because the clients won't

jesopo commented 2 years ago

an additional note is we've got enclaves of clients like this on big networks like libera, and it means that libera can't meaningfully switch to being utf8-only without being hostile to users, which means any spec wanting support from libera is going to have to consider data being non-utf8 bytes or will need to wait past a transitional period of years and years and hope the nonutf8 scene quantifies and is much smaller than we thought or people just move on

slingamn commented 2 years ago

an additional note is we've got enclaves of clients like this on big networks like libera

Is there a way to get data on the prevalence of this, and on what encodings they're using? Without that information it seems difficult to evaluate any proposal (including server-side transcoding).

ircv3 / ircv3-ideas

Recommend using UTF-8 #38