kent-karlsson / control

ECMA-48 update proposals + math expression representation proposals
1 stars 0 forks source link

CSI 0m vs. BiDi problem; unreasonable switch to SGR for some features #18

Closed egmontkob closed 1 year ago

egmontkob commented 1 year ago

This bugreport assumes that in that comment I did understand CSI 0m (a.k.a. SGR 0) correctly; that is, that you intend to say that

CSI 0m [...] closes [...] bidi controls (ECMA-48 or Unicode)

This cannot reasonably work.


Formatting a document, accoring to the Unicode BiDi Algorithm (UBA), works along these lines:

First, using means outside of UBA, the to-be-displayed document is divided into paragraphs. A paragraph is what people usually use this word for: a fragment of text surrounded by at least line breaks, if not some more prominent visual separation. (Other clearly separated pieces of text, such as titles, image captions, footnotes, list items, table cells etc. are also considered "paragraphs" for this purpose.)

Then for each paragraph the actual core of UBA is run, which (relying on some external help in deciding how much text fits into a visual line and when to start the next line) tells in what order the letters need to be laid out.

There is no Unicode character, nor some other control mechanism, that could tell UBA in the middle of a paragraph to "close BiDi controls" (i.e. all of them, without knowing which ones and how many are open).

If you wanted people to implement something like this, you'd have to open up UBA, make modifications to it, publish those modifications exactly, and expect people to invoke this modified algorithm (which would presumably have no out-of-box implementation to use). This is clearly not going to happen.

So this approach is not feasible at all.


Thus the only way to reasonable implement your idea of resetting the BiDi controls is: declare that CSI 0m is a paragraph separator.

Which is not how it works currently in terminals. In current practice, it only modifies rendering properties of the characters that can apply to each character individually, it does not (and should not) modify the overall layout. Geez, it does not even start a new line. It does not even add a space.

You can't finish a BiDi paragraph in the middle of a line and start a next one there in the same line. The horizontal order of those two partial lines (appearing in the same line) would be undetermined. It's exactly UBA that's supposed to tell the order, but you'd explicitly stop UBA from doing this. UBA wasn't designed to handle this situation. (If UBA wanted to and could handle this situation, it would operate on sentences (or something similar) rather than paragraphs.)

Modifying CSI 0m to start a new paragraph (i.e. new line at the very least) would not only break tons of existing software, but also would result in a terribly ugly grammar for your new language. A CSI ... m escape sequence would start a new paragraph, i.e. have a much stronger semantical and visual meaning for the overall structure, if and only if there's a number 0 (or empty string) somewhere among the parameters?? Control codes having such a strong effect need to have their dedicated escape sequences. The value of a numerical parameter shouldn't be able to cardinally change the extent of the escape sequence's effect.

CSI 0m is very often used in terminals to close an opened attribute, such as color, boldness, etc. Due to a legacy design bug in terminfo, tools operating via the terminfo layer can't even close the "bold" attribute, they have no other choice than to close everything (CSI 0m) and then reopen the properties they wished to keep. The frequent use of CSI 0m within a paragraph of text is not going to disappear.

So this approach is not reasonable either.


The only reasonable choice is: CSI 0m should not affect BiDi.

And while at it... if you look at the SGR (a.k.a. CSI ... m) codes of ECMA, you'll notice that all of them only tamper with local decorational properties of the characters, nothing on the bigger scope.

It's presumably not just BiDi that is problematic to be integrated here. You squeeze here other properties, like line spacing, line indents and justificaton, advancement modification, margins etc. Given the current widespread use of CSI 0m for resetting only the per-character attributes, it should not reset properties of a different scope.

I believe there's a good reason these weren't made part of the SGR set decades ago, keeping SGR solely for per-character decorational attributes. Everything has its place, for per-character attributes it's SGR, for others it's something else in ECMA. Clear and clean story. Why change it, why wash up some fundamentally different concepts??


I know you say:

CSI 0m should not be used in any other context [than at the beginning of prompts]

This has no connection with reality whatsoever, and it's absolutely hopeless to change the world of terminal emulation to make this a requirement.

A "should not" is not a "must not" and terminals would need to know what to do if a CSI 0m is not at the beginning of a line.

There's no guarantee that a prompt is preceded by a newline, what if not? (In fact, notice that ECMA doesn't mention "command line" or "prompt" or alike at all, they are not concepts a terminal should know anything about.)

(And you also talk about some ISCII story where CSI 0m might appear, presumably not necessarily after a newline, but I cannot comment on this.)

kent-karlsson commented 1 year ago

See reply to issue 16.