gwsw / less

Less - text pager
http://greenwoodsoftware.com/less
Other
554 stars 88 forks source link

8-bit CSI (Windows) #172

Open adoxa opened 3 years ago

adoxa commented 3 years ago

I noticed a "\x9B\xC4" sequence was removed, which was strange, since I've defined all the "extended ASCII" as text. Then I remembered that '\x9B' is the 8-bit CSI. Since Windows doesn't support that, atm I've simply done:

#if MSDOS_COMPILER
#define IS_CSI_START(c) (((LWCHAR)(c)) == ESC)
#else
#define IS_CSI_START(c) (((LWCHAR)(c)) == ESC || (((LWCHAR)(c)) == CSI))
#endif

but I'm not sure if that's really the best approach. Maybe it'd be better to test if the system/charset supports CSI?

gwsw commented 3 years ago

How are you thinking that the test for CSI would work? I'm don't know how it would be possible to test for CSI support.

adoxa commented 3 years ago

I thought there'd be something in terminfo for it, but apparently not; by charset I mean something like (((LWCHAR)(c)) == CSI & control_char(c)). It seems that would pretty much only be meaningful on DOS/Windows, anyway, so the compiler test sounds like the way to go, after all.

gwsw commented 3 years ago

Well, this issue is a little unclear to me. It seems to me that the presence of a 0x9B CSI is related to the source of the document being viewed, rather than the system that less is running on. It seems unexpected behavior that a document containing a 0x9B CSI would appear different when viewed on a Windows system. Maybe this should be controlled by the charset selected?

adoxa commented 3 years ago

Another issue is that the SGR emulation only detects 7-bit, not 8-bit. I've never come across 8-bit CSI, so I'm content with the compiler #if, which I could just add to my own build.

gwsw commented 3 months ago

Closing as (sort of?) resolved. Please reopen if there is still an issue for less.

adoxa commented 3 months ago

The issue remains - with -R CSI (character '\233') is unconditionally a control character, when it should remain a normal character.

C:\Projects\less>less --version
less 660 (Spencer V8 regular expressions)
Copyright (C) 1984-2024  Mark Nudelman

less comes with NO WARRANTY, to the extent permitted by law.
For information about the terms of redistribution,
see the file named README in the less distribution.
Home page: https://greenwoodsoftware.com/less

C:\Projects\less>set less
LESSBINFMT=*k[%X]
LESSCHARDEF=8bcccbcc12bc5b.
LESSOPEN=|-lessutf %s
LESSSEPARATOR=/

C:\Projects\less>chcp
Active code page: 437

C:\Projects\less>type not-csi8.txt
[¢£]

C:\Projects\less>less -F not-csi8.txt
[¢£]

C:\Projects\less>less -RF not-csi8.txt
[]
avih commented 3 months ago

... with -R CSI (character '\233') ...

Shouldn't that depend on the encoding? because the byte \233 (0x9b, decimal 105) is invalid on its own in UTF-8 for instance.

Admittedly, I never understood how the 7/8 bit terminal thing works (also not after reading briefly the xterm "Control Bytes, Characters, and Sequences" section here), but I'd think that "8 bit control" or the C1 class (128 to 159 decimal) is not used in modern contexts.

Where should it matter? in old content, encoded in some 8-bit codepage, which contrains C1 control like 0x9b for CSI instead of the more common ESC [ and which is SGR which is supposed to passthrough to the terminal with -R?

Is "less" supposed to transform it somehow to 7-bit SGR?

And yes, the SGR processor doesn't handle it currently (also not the new one which is intended to replace the current one).

avih commented 3 months ago

I'm content with the compiler #if, which I could just add to my own build.

Can you explain which issue is this supposed to solve?

avih commented 3 months ago

OK, I understand the issue: if 0x9b appears as part of UTF-8, like U+26D2 | ⛒ | \xe2\x9b\x92

Then it displays incorrectly in "less" in some cases.

Because the current CSI detection in "less", which is defined as

#define CSI             ((unsigned char)'\233')
...
#define IS_CSI_START(c) (((LWCHAR)(c)) == ESC || (((LWCHAR)(c)) == CSI))

Incorrectly identify the 0x9b as CSI instead of continuation UTF-8 byte, and would mess the output.

However, I can't reproduce an issue on Windows normally:

It displays correct as ⛒ in both cases for me.

However, when using LESSOPEN then it's broken for me (⛒ may appear as a block if the font doesn't have this glyph. Windows terminal displays it correctly):

R:\temp>less -XE test.txt
⛒

R:\temp>less -RXE test.txt
⛒

R:\temp>type test.txt | less -XE
⛒

R:\temp>type test.txt | less -RXE
⛒

R:\temp>set LESSOPEN=^| type %s

R:\temp>less -XE test.txt
<9B><92>

R:\temp>less -RXE test.txt
<9B><92>

R:\temp>type test.txt | busybox xxd
00000000: e29b 920a                                ....

It's unclear to me why detecting 0x9b as CSI removes the byte before it (0xe2 is removed, 0x9b 0x92 are kept).

It's not an issue of type as far as I can tell, because it seems to pass the content correctly (at least to busybox xxd and when piped into less manually), and also because I reproduced it also with other LESSOPEN filters, like source-highlight.

So I think there is an issue, and it should be reopened.

As far as I can tell, the description is: "On windows, when using LESSOPEN, content which contains 0x9b as part of valid UTF-8 displays incorrectly"

I didn't test whether this happens on non-windows platforms.

adoxa commented 3 months ago

Is "less" supposed to transform it somehow to 7-bit SGR?

No, it's supposed to leave it alone and display it as any other normal character. Windows (and DOS) has never had 8-bit CSI (or 8-bit control characters in general, they've always been displayable characters). That's what the #if solves.

avih commented 3 months ago

Is "less" supposed to transform it somehow to 7-bit SGR?

No, it's supposed to leave it alone

Well, this is what you want it to do.

My question was about why it's there originally. I.e. in which use cases IS_CSI_START benefits from also testing 0x9b, and when detected as such, what is "less" supposed to do about it.

This was added at commit 0dc8ecf9 "Accept either ESC-[ or CSI (0x9B) as a start of color escape sequence. Patch from Andreas Wetzel."

So presumably this is supposed to detect CSI at the input, apparently to detect SGR sequences at the content. The question is why is it needed? and then what happens?

I'm guessing at least one thing:

And possibly another thing:

So the question is when can the content contain 8-bit CSI?

And I think the answer is "almost never". Because in all the code pages I can dig up, including old IBM PC bios, the range 128-159 (C1) is used for graphic symbols (specifically, 0x9b looks similar to >) - and not as C1 controls. But I'm guessing there's also some 8-bit-ASCII encoding where 128-159 are C1. Maybe some teletypes or 70's/80's terminals?

From C0_and_C1_control_codes:

Except for SS2 and SS3 in EUC-JP text, and NEL in text transcoded from EBCDIC, the 8-bit forms of these codes were almost never used. CSI, DCS and OSC are used to control text terminals and terminal emulators, but almost always by using their 7-bit escape code representations. Nowadays if these codes are encountered it is far more likely they are intended to be printing characters from that position of Windows-1252 or Mac OS Roman.

So I guess the issue is that the 8-bit CSI is always tested, while in fact it should only be tested when the encoding/lesscharset/whatever is this specific 8-bit-ascii thing, and not one of the more common encodings.

I don't think it's related to windows. I think it's related to the supposed encoding of the content, where it should be tested only if the encoding is such that 128-159 are indeed C1 and not graphic symbols.

I don't know if "less" has an encoding name which is recognized as such, but that's the only case where I think 0x9b should be tested.

It is weird that on windows it only seems to happen when using LESSOPEN - and also regardless of -R. This might be worth some investigation, because it could be another unrelated issue where the content is parsed differently depending on LESSOPEN.

On linux it displays correctly with LESSOPEN="|cat %s" less test.txt where test.txt contains U+26D2 ⛒.

avih commented 3 weeks ago

Few comments:

However, when using LESSOPEN then it's broken for me (⛒ may appear as a block if the font doesn't have this glyph. Windows terminal displays it correctly): Create a text file test.txt which contains ⛒ (\xe2\x9b\x92) folllowed by newline.


R:\temp>type test.txt | less -RXE
⛒

R:\temp>set LESSOPEN=^| type %s

R:\temp>less -XE test.txt

<9B><92> ``` > On linux it displays correctly with `LESSOPEN="|cat %s" less test.txt` where test.txt contains U+26D2 ⛒. This did not exhibit an issue on linux because I was using the system `less` (v590) instead of my own build. When using my own build of `less` then this was indeed broken also on linux, with or without -R: ``` $ xxd test.txt 00000000: e29b 920a $ LESSOPEN= ./less -XE test.txt ⛒ $ LESSOPEN= ./less -RXE test.txt ⛒ $ LESSOPEN="|cat %s" ./less -XE test.txt <9B><92> $ LESSOPEN="|cat %s" ./less -RXE test.txt <9B><92> ``` However, this was fixed just now by 03f9f1a6 , but that was a different issue where the 1st byte from LESSOPEN filter is ignored if it's more than 127 - like any UTF-8 sequence (when `char` is signed - which it is on windows and linux). Anyway, indeed, with commit 03f9f1a6 , all my test cases now works correctly for me on both linux and windows. But this means that I can't reproduce the original issue anymore where 0x9b is interpreted as CSI rather than part of UTF-8. I also still can't reproduce the example given above: ``` C:\Projects\less>less --version less 660 (Spencer V8 regular expressions) ... C:\Projects\less>set less LESSBINFMT=*k[%X] LESSCHARDEF=8bcccbcc12bc5b. LESSOPEN=|-lessutf %s LESSSEPARATOR=/ C:\Projects\less>chcp Active code page: 437 C:\Projects\less>type not-csi8.txt [¢£] C:\Projects\less>less -F not-csi8.txt [¢£] C:\Projects\less>less -RF not-csi8.txt [] ``` Because: - I don't know which program is used and what it does: `LESSOPEN=|-lessutf %s` - I don't know what is the binary content of the file `not-csi8.txt`. If I copy paste `[¢£]` then I get the bytes `5b c2 a2 c2 a3 5d` and this doesn't have 0x9b (the CSI value). Is there some failed conversion here? @adoxa can you please try master, and if it's still broken for you (without your patch to disable CSI on windows), provide a test case, preferably minimal, to reproduce the issue?
avih commented 3 weeks ago

OK, I think I have at least one minimal test case which still exhibits the issue. As far as I can tell it doesn't need LESSOPEN or LESSBINFMT or LESSSEPARATOR, and it's enough to modify LESSCHARDEF to trigger the issue.

The file test.txt contains 4 bytes: 78 9b c4 79. These are the ascii x (0x78), the bytes 9b and c4, and the letter y (0x79).

In codepage 437 the byte 9b is the "cent" symbol U+00A3 ¢, and the byte c4 is U+2500 BOX DRAWING LIGHT HORIZONTAL .

In current master:

R:\>chcp
Active code page: 437

R:\>type test.txt
x¢─y

R:\>set|findstr LESS

R:\>less -RXE test.txt
x<9B><C4>y

R:\>set LESSCHARDEF=8bcccbcc18b.

R:\>set|findstr LESS
LESSCHARDEF=8bcccbcc18b.

R:\>less -XE test.txt
x¢─y

R:\>less -RXE test.txt
xy

And I think the expected behavior when 33-255 are normal chars is that 0x9b should NOT be considered CSI, and instead printed verbatim.

(LESSCHARDEF=8bcccbcc18b. is the same as the internal "less" definition for "ascii", but 33-255 are normal chars)

And indeed, if I patch the source so that IS_CSI_START tests only ESC and not \233 (0x9b), then it works as expected above:

R:\>set|findstr LESS
LESSCHARDEF=8bcccbcc12bc5b.

R:\>less -RXE test.txt
x¢─y

So it seems that setting 33-255 to be normal chars via LESSCHARDEF makes it identify 0x9b as CSI_START as far as I can tell (only when using -R. without -R it does work as expected).

However, by default (where currently the default charset is UTF-8), 0x9b on its own is recognized as invalid UTF-8 and printed as <9B>, and if it's part of valid UTF-8, like in U+26D2 ⛒ (0xe2 0x9b 0x92), then by default it does print correctly as the correct UTF-8 sequence, and 0x9b does not dissappear from the output.

@gwsw Can you please clarify (to me) the logic of when 0x9b is considered IS_CSI_START? because from observation of the behavior it seems to me (when using -R) that it's not consider CSI_START with UTF-8 charset, but is considered CSI_START with custom CHARDEF where 0x9b is defined as a regular char (and which I think should be considered a bug).

I'd think this should also happen on linux and other unix, but I didn't test that.

avih commented 3 weeks ago

I think that 0x9b should be considered CSI_START only in charsets where 0x9b (155 decimal, 0233 octal) is defined as "control".

I also think that most charsets should not define 128-159 (C1 range) as control, because indeed this range is typically used for graphical characters. This probably holds for almost all single-byte encodings.

Maybe there should be an additional charset "ansi" which defines the C1 range 128-159 as control, so that 0x9b is identified as CSI_START in this charset?

Similarly, IMO, ESC (0x1b) should also only be considered as CSI_START (or OSC_START if there is such) only when the charset classifies it as "control".

gwsw commented 3 weeks ago

As far as I know, 0x9b is always considered to be CSI, regardless of the character set or the presence of -R. I think the fact that it displays as <9b> when the charset is UTF-8 is a somewhat unintended side effect of flush_mbc_buf(), which gets called to display an invalid UTF-8 sequence.

Treating 0x9b as CSI only when the charset defines it as a control char seems reasonable, but I haven't thought it through completely yet.

avih commented 3 weeks ago

I think the fact that it displays as <9b> when the charset is UTF-8 is a somewhat unintended side effect of flush_mbc_buf(), which gets called to display an invalid UTF-8 sequence.

It also works as expected (displayed correctly, not considered CSI) when it's part of valid UTF-8 with or without -R.

Treating 0x9b as CSI only when the charset defines it as a control char seems reasonable, but I haven't thought it through completely yet.

Right, and I think same goes for ESC (and maybe OSC in C1 is it exists - I didn't check, but that becomes complex because the terminal needs to treat it like that as well).

As for 0x9b as CSI, I didn't check how that works, but if when using -R then it's passed-through to the terninal, then it's assuming that the terminal can handle it too, but I'd guess it's not a bullet proof assumption either.

avih commented 2 weeks ago

By the way, I couldn't find an explanation at the manpage about the difference between "control" and "binary". How are they treated differently? E.g. in "ascii" 0-31 are binary, except 8/9/10/12/13 which are control (if I read it correctly). What gives?

And in "utf-8" charset, why do most of the upper half values considered binary and not normal?

What does "normal" mean to the code? Does it mean in single-byte encoding the same thing as in multi byte encoding like UTF-8?

gwsw commented 2 weeks ago

Control and binary chars are treated mostly the same. I think the only difference is if the file contains more than 5 "binary" chars in the first 256 bytes, less prints the "may be a binary file" warning message. The idea is that control chars are expected to appear in text files but binary chars are not. This should be documented better.

I'm not sure if the desc field of the utf-8 entry in the charsets table is actually used. II think in utf-8 mode it uses only the ubin and fmt tables in charset.c (built from the Unicode database) to determine whether UTF-8 characters are control, binary or printable. I suspect the charsets[] entry is left over from early versions that implemented UTF-8 parsing but didn't use the Unicode database. I'll look into this.

"Normal" just means printable. We assume that normal chars can be output directly to the terminal and won't do anything strange; just occupy one printing position (or two for double-width chars). It should mean the same in single-byte and multi-byte encodings.

avih commented 2 weeks ago

Control and binary chars are treated mostly the same. I think the only difference is if the file contains more than 5 "binary" chars in the first 256 bytes, less prints the "may be a binary file" warning message.

Though I guess this somehow doesn't hold for UTF-8 content, right?

The idea is that control chars are expected to appear in text files but binary chars are not. This should be documented better.

Same (for the current "desc" field for "utf-8")...

normal ... just occupy one printing position (or two for double-width chars). It should mean the same in single-byte and multi-byte encodings.

Same... it would indeed be awkward to try to use the desc field for UTF-8, except for values below 128.

gwsw commented 2 weeks ago

Though I guess this somehow doesn't hold for UTF-8 content, right?

In UTF-8 mode, it still checks for "binary" chars in the file. It considers malformed UTF-8 sequences to be binary data, and also chars that belong to certain categories in the Unicode database (Cc, Cs, Co, Zl and Zp).