Open xPaw opened 5 years ago
Most (if not all encodings) include the base ASCII set (<128 bytes), which makes UTF-8 backwards compatible here
ISO-2022-JP is an example encoding which does not work that way, even though it uses only values < 128.
The servers and clients that use not-utf8 are the ones that clearly don't care what ircv3 says IMO.
(To be clear, duh recommend it, but it won't change anything)
For reference, ZNC had an issue where a decoded line using cp1140 could contain new lines where they were not present in the raw buffer.
https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2018-14055
Looking at https://en.wikipedia.org/wiki/EBCDIC_037 the issue probably boiled from "half of the control character codes can be translated into their exact ASCII equivalents".
Here's another point: websockets spec specifies UTF-8.
https://github.com/ircv3/ircv3-specifications/pull/342#issuecomment-647514606
We should use UTF-8 as its pretty much almost universally used by almost everything, including posix locales.
I can in fact use UTF-8 on the modern *NIX command line. Anything copy and pasted from IRC should work with modern text editors, which again, use UTF-8
This should really be in IRCv3 as a spec. Should also note using codepages, for any language is obsolete, and clients discouraged or forbbiden in protocol for using them.
We now have the UTF8ONLY token which works great for new servers which only want to support UTF-8 but existing servers with non-UTF-8 clients need a way of migrating to this.
I propose we add a similar token which allows legacy servers to request UTF-8 from clients in a non-binding way to allow supporting clients to switch their connection configs to only send UTF-8 but still accept non-UTF-8 from other clients. This will allow existing servers to migrate to UTF-8 over time.
(And no, "fuck non-UTF-8 users" isn't an option for networks where people speak non-Western languages).
This proposal reminds me of STS. STS is a well-designed specification with careful attention to backwards compatibility considerations, but I don't see much evidence that it has actually led to greater adoption of TLS. It's not widely implemented by client developers, but more to the point, legacy clients that can't/won't upgrade to an STS-capable version (old versions of mainstream clients, bots, and other ad-hoc setups) are likely disproportionately represented among plaintext connections. So STS is not really a feasible transition mechanism to a TLS-only world. If you want to go TLS-only, at some point you have to tear off the band-aid and stop supporting plaintext, leaving those users out in the cold.
I think any proposal for an incremental migration to UTF8ONLY would face more or less the same concerns. Server operators could likely get equal or better results through a campaign of "nudging" users (via the MOTD and/or global notices) to reconfigure their clients.
isn't an option for networks where people speak non-Western languages
As I understand it, the main alternative to UTF8 in current use is Latin-1 (probably followed distantly by Shift-JIS), so it's not really "non-Western" languages that are affected.
UTF-8 support should be mandated for advertising IRCv3 compatibility. There should be fallback modes for old implementations.
There should be support for a UTF8ONLY flag/setting, and it should be entirely optional. all previous encodings should be marked as "legacy"
I think any proposal for an incremental migration to UTF8ONLY would face more or less the same concerns. Server operators could likely get equal or better results through a campaign of "nudging" users (via the MOTD and/or global notices) to reconfigure their clients.
I'm going to be direct: this might work for small networks with a dozen or so users like Ergo has but large networks do not want to do this and we need a migration path for large networks or they will never adopt UTF-8 which impedes support of any features which rely on it (see also: websockets).
I've actually seen data for nudging clients to migrate as we did this on Snoonet when upgrading it to InspIRCd v3 but even after months of nudging there was still 5%~ of total users (around 1/4 of broken users) who had not upgraded and who's clients broke when we finally pushed the upgrade. Switching encoding is an even bigger breakage than what we did then so it would be very disruptive to hundreds of users (or potentially thousands on networks like Libera).
I think I understand the scope of the problem. I'm saying that there is no solution: a spec that claims to solve it will just add clutter.
I'm saying that there is no solution: a spec that claims to solve it will just add clutter.
You made a bunch of unsubstantiated incorrect claims about STS (several clients including AdiIRC, mIRC, and IRCCloud all support it and in HexChat it is currently awaiting review) and incorrectly applied these to a proposal which isn't even related.
This proposal reminds me of STS. STS is a well-designed specification with careful attention to backwards compatibility considerations, but I don't see much evidence that it has actually led to greater adoption of TLS. It's not widely implemented by client developers, but more to the point, legacy clients that can't/won't upgrade to an STS-capable version (old versions of mainstream clients, bots, and other ad-hoc setups) are likely disproportionately represented among plaintext connections. So STS is not really a feasible transition mechanism to a TLS-only world. If you want to go TLS-only, at some point you have to tear off the band-aid and stop supporting plaintext, leaving those users out in the cold.
this isn't really the goal of sts, it's about helping incorrectly configured modern clients. what will happen over time, as software naturally gets updated or the machines with outdated clients die, is clients will shift further towards being utf8 only until all that remains is people that pushed the wrong button.
As I understand it, the main alternative to UTF8 in current use is Latin-1 (probably followed distantly by Shift-JIS), so it's not really "non-Western" languages that are affected.
Shift JIS is a character encoding for the Japanese language
not sure where in the west japan is. anyway I don't really care about it mostly affecting "non-Western" languages, I care about it mostly affecting people that don't speak English.
What would you estimate as a realistic timeframe for either of these proposals "working", i.e. either
UTF8ONLY
?STS is not the same because it only creates a time-limited redirect for clients that do not connect securely. There is no transition period with STS because it does not migrate clients permanently. Assuming your users' clients support it you can just deploy it and disable connection registration on plaintext ports like testnet.inspircd.org
does on 6667.
My proposal is that clients that see a UTF-8-preferred token should change their configuration permanently to send UTF-8 whilst still being able to receive non-UTF-8 from the server (e.g. via PRIVMSG). This migrates users' client configuration over time to use UTF-8.
Assuming your users' clients support it you can just deploy it and disable connection registration on plaintext ports like testnet.inspircd.org does on 6667.
That's the hard part, though --- the assumption that all clients will support it. If you do this then you are breaking connectivity for all clients that are configured to use plaintext and don't support STS. With the current state of STS adoption, I imagine this is a dealbreaker for some large networks (Libera?) that currently offer plaintext. So the question remains: what level of STS adoption would render this proposal feasible?
My proposal is that clients that see a UTF-8-preferred token should change their configuration permanently to send UTF-8 whilst still being able to receive non-UTF-8 from the server (e.g. via PRIVMSG). This migrates users' client configuration over time to use UTF-8.
If the end goal is to enable UTF8ONLY
, then the question remains: what level of client adoption would be necessary to reach the end goal? If that's not the end goal, then what is? Is anything achieved in the meantime by a partial transition to UTF8?
I think the realistic migration path to UTF8ONLY
involves (after an initial period of nudging) server-side transcoding.
webircproxy implements three configurable transcoding algorithms. The one most relevant to this issue is:
chardet
algorithm and transcode from the detected encoding to UTF-8. If chardet
fails, transcode to UTF-8 using the Unicode replacement character.Initial benchmarking (on a decade-old CPU) is encouraging. If a message is already UTF-8, this can be validated in less than a microsecond. The slow path (for messages that actually require transcoding) costs approximately 150 microseconds per message, which seems acceptable.
The other two modes of operation are significantly faster, around 5 microseconds per message: either transcode using a fixed encoding, or transcode using the Unicode replacement character.
If I may be salty for a second.
Clients should just default to strict-utf8 (with per-network ways to be strict-otherencoding). A few people will be upset for a year or two until updates make it to users, and then the majority will just have a sane client. HexChat did this in 2016, the world didn't end, people didn't mass-exodus from the software, and as of today people sending weird hybrid encodings to hexchat are treated as the weird broken text they are.
All of this planning about specs and trying to make everybody happy will go nowhere.
I wish you hadn't been salty about the entire process IRCv3 is undertaking for incremental and backwards-compatible improvements, but here we are.
A lot of the problem here is the amount of non-UTF8 data in the wild is mostly uncountable but I would assume a lot of it is from people using outdated clients on weird networks and probably using stuff like mIRC rather than hexchat. unquantifiable problem, but much as I can't assert it's a massive problem, you can't assert that it isn't.
but I would assume a lot of it is from people using outdated clients on weird networks and probably using stuff like mIRC rather than hexchat.
Then what can we do about it? They will never get new client features, they connect to servers that probably never get new features. They are in a bubble and will continue to be.
sure, but it leaves them in a stalemate. the server can't update because the clients won't
an additional note is we've got enclaves of clients like this on big networks like libera, and it means that libera can't meaningfully switch to being utf8-only without being hostile to users, which means any spec wanting support from libera is going to have to consider data being non-utf8 bytes or will need to wait past a transitional period of years and years and hope the nonutf8 scene quantifies and is much smaller than we thought or people just move on
an additional note is we've got enclaves of clients like this on big networks like libera
Is there a way to get data on the prevalence of this, and on what encodings they're using? Without that information it seems difficult to evaluate any proposal (including server-side transcoding).
Here are some thoughts in no particular order from today's IRC discussion:
I am most likely missing more points, but that's what I currently remembered from top of my head.
The current situation is pretty sad, in the way that a lot of clients try to decode UTF-8 and if it fails, they fallback to one of the common encodings (like latin1), which still leads to messy results. Enforcing (or at least strongly suggesting) use of unicode will increase interoperability and compatibility between all kinds of IRC software.
Legacy charsets are not compatible with one another, and are super limited. A glaring example of this is emojis. If a client wants to correctly send and display them, it needs to deal with unicode correctly.