Closed egmontkob closed 1 year ago
No, it is a hint of a "possible use"; it is not part of the proposal!
If one were to make a proposal for this, the use would be very close to the use of "charset" declaration in HTML. For instance it would have to be at or very near the beginning of the document, and any "later" occurence would be ignored.
In addition, "charset" is the term used by IANA (and HTML). That it is technically wrong is nothing that can be fixed (ever?), especially not in this proposal (which in addition does not propose this, but only hints at a "possible use").
However, it is part of the proposal to disallow (or ignore) escape sequences and control codes that (used to) do code page switching (to the extent they were at all implemented).
Also: UTF-16, for instance, would be fine to use in conjunction with the proposed update. But not for terminal emulators.
You propose SOS charset=charset-name ST.
(Let's put aside the subtle difference that you're talking about encoding and not charset, even HTML gets it wrong.)
Please remove this -- at the very very least remove the possibility of specifying the charset anywhere within the document, and if you insist on keeping this feature then come up with a syntax that mandates that the field is at the very beginning of the document.
The world has known for 30+ years how to handle all the characters correctly in the same document, and it's been widely adopted for at least the latter half of this period.
Adding support for arbitrary encodings means that every implementation MUST support this tag, and MUST decode the following text accordingly. It's a tremendous amount of additional work for every implementation. Support for every other feature is optional, an implementation is free to say "I don't support bold, or underline, or colors", but replacing the human-visible letters with other human-visible letters is always unacceptable.
Nowaways many terminal emulators only support UTF-8. There was never an escape sequence (as far as I know) to switch to an arbitrary one, there was one to switch from the default to UTF-8 and back. But given that for 15-ish years all major distributions have been shipping UTF-8 as their default, this escape sequence became a no-op. VTE, for one, has kept (for now) support for various encodings, but has dropped support for this particular pair of escape sequences and no one complained.
You're creating a richtext-like protocol in order to improve the plain text experience, but on-the-fly encoding change was never part of plain text, so why introduce now when the industry has surpassed the need for it decades ago?
I can absolutely not imagine any scenario whatsoever where it's desirable to switch encoding in the middle of the document.
It's especially unreasonable to require to implement it if a change of the encoding would change the escape sequences themselves. On-the-fly change from UTF-8 to UTF-16? Hell no thanks! On-the-fly change from UTF-8 to Latin-1 whereas C1 control chars are used? Hell no! (In the unlikely case that you don't understand me, please write an effective, fast, clean, maintainable parser and share with us here.)
Note that even the possibility of optinally specifying the encoding in a header (let's say within the very first escape sequence) plays really badly with the possibility of C1 controls (but that's for another time: this is another reason not to allow C1): Depending on what the escape sequence will contain, you need to recognize differently where that very escape sequence begins. No, no, no!
My preference would be: Not have any such escape sequence. Mandate UTF-8 anywhere where this format is used in the interface. As long as it's internal to some application, leave it to that app what it prefers to use (could even be Latin-1 or UTF-16 or UTF-32). Or just entirely leave it unspecified. You're beefing up the plain text experience, and the encoding info has always been carried externally to the plain text. Maybe it can remain external.
Alternatively, maybe: Mandate that the encoding, if specified, is at the very beginning. But I'm warning you: Implementations will get it wrong (if there will be any implementations to speak about), some will not bother handling the field as required. It's a giant unnecessary burden for any consumer of such a file; a much-much bigger burden that for any producer to having to produce the file in UTF-8.