rprichard / winpty

A Windows software package providing an interface similar to a Unix pty-master for communicating with Windows console programs.
MIT License
1.3k stars 166 forks source link

Recent Windows 10 ReadConsoleOutput change breaks non-ASCII text #105

Open rprichard opened 7 years ago

rprichard commented 7 years ago

A change to the console in new versions of Windows 10 (e.g. 15048, but not 15014), breaks winpty by effectively shifting cells at the start of a line to the end of the previous line when certain characters appear, in certain fonts.

This winpty issue caused this downstream issue in VSCode, https://github.com/Microsoft/vscode/issues/19665.

Follow these links for more details:

zadjii-msft commented 7 years ago

@miniksa for visibility. I believe he changed some stuff in that API to make it consistent with ConhostV1.

miniksa commented 7 years ago

There's a fix that didn't make RS2 RTM. I'm looking into what we can do about it.

rprichard commented 7 years ago

It's probably easy for winpty to detect the broken ReadConsoleOutput behavior. (e.g. At startup, write some potentially-wide characters and see whether ReadConsoleOutput mishandles any of them.) winpty could then enable a workaround.

e.g. winpty could call ReadConsoleOutput as it does now, but restore the original API behavior by expanding a CHAR_INFO record into two CHAR_INFO records when the UnicodeChar value is a 2-column character. It would determine that by either:

  1. Duplicating the logic from PowerShell (https://github.com/Microsoft/vscode/issues/19665#issuecomment-287477274), or

  2. Asking the console: create a special, hidden console screen buffer, then use WriteConsole and ReadConsoleOutput. Write something like ${UnicodeChar}\nAB, then read a 2x2 rectangle and see whether the third cell is A or B.

This workaround would add a lot of complexity, so I'd prefer some other solution if possible.

API note: ReadConsoleOutput makes it clear that the CHAR_INFO buffer is a two-dimensional array, but I don't think MSDN ever explains how a two-column character is supposed to be represented. It mentions the COMMON_LVB_{LEADING,TRAILING}_BYTE attributes, but doesn't describe them in any detail, and of course, the names aren't quite right -- we're looking for leading/trailing cells, not bytes.

Aside: winpty may also need to stop using such tiny fonts.

shirosaki commented 7 years ago

As a workaround, reading lines one by one seems to fix strange line break output.

diff --git a/src/agent/Win32ConsoleBuffer.cc b/src/agent/Win32ConsoleBuffer.cc
index ed93f40..dc4efce 100755
--- a/src/agent/Win32ConsoleBuffer.cc
+++ b/src/agent/Win32ConsoleBuffer.cc
@@ -157,8 +157,18 @@ void Win32ConsoleBuffer::setCursorPosition(const Coord &coord) {
 void Win32ConsoleBuffer::read(const SmallRect &rect, CHAR_INFO *data) {
     // TODO: error handling
     SmallRect tmp(rect);
-    if (!ReadConsoleOutputW(m_conout, data, rect.size(), Coord(), &tmp) &&
-            isTracingEnabled()) {
+    CHAR_INFO *buffer = data;
+    Coord bufferSize = Coord(rect.width(), 1);
+    BOOL success = TRUE;
+    for (SHORT y = rect.Top; y <= rect.Bottom; y++, buffer += rect.width()) {
+        tmp.Top = y;
+        tmp.Bottom = y;
+        if (!ReadConsoleOutputW(m_conout, buffer, bufferSize, Coord(), &tmp)) {
+            success = FALSE;
+            break;
+        }
+    }
+    if (!success && isTracingEnabled()) {
         StringBuilder sb(256);
         auto outStruct = [&](const SMALL_RECT &sr) {
             sb << "{L=" << sr.Left << ",T=" << sr.Top
rprichard commented 7 years ago

I think a better way to accomplish this is to set (useLargeReads, maxReadLines) to (false, 1). https://github.com/rprichard/winpty/blob/4978cf94b6ea48e38eea3146bd0d23210f87aa89/src/agent/LargeConsoleRead.cc#L50.

I'd expect that change to help mitigate the situation -- actually, if it works, it's probably a good idea.

I'm not sure what is expected to appear at the end of the ReadConsoleOutput buffer -- IIRC, I've only seen CHAR_INFO records that are all zeros -- effectively black-on-black NUL. I wonder about the effect on the console->terminal conversion code here, https://github.com/rprichard/winpty/blob/4978cf94b6ea48e38eea3146bd0d23210f87aa89/src/agent/Terminal.cc#L369. I'm guessing it'd output an entire line, followed by a color change to Black-on-Black-plus-Conceal, followed by NULs that the terminal ignores? The workaround should probably include code removing zeroed CHAR_INFO values from the Terminal::sendLine width.

Edit: fix "followed by two NULs that the terminal ignores" -- it could be one NUL or any number of NULs.

rprichard commented 7 years ago

@miniksa Can you confirm that when this issue (i.e. the double-column ReadConsoleOutput bug) occurs, that the fields of the trailing CHAR_INFO values will be zero?

rprichard commented 7 years ago

This workaround would reduce the likelihood of a successful Scraper::scrollingScrapeOutput tentative read, but I think that's acceptable. I think it doesn't affect correctness, because the Scraper checks at the end whether the sync marker moved while it was reading screen buffer data and properties.

@shirosaki FWIW, that change would break Scraper::findSyncMarker, which assumes it can read an entire screen buffer column (3000 lines) efficiently and atomically.

miniksa commented 7 years ago

@rprichard The problem is that ReadConsoleOutput* is going to give you different information depending on whether the original text was written with WriteConsoleOutputA, WriteConsoleOutputW, WriteFile, WriteConsoleA, or WriteConsoleW AND whether or not a Raster font or TrueType font is used at the time.

I can't definitively tell you what the appropriate pattern for double-byte characters will be nor what the trailing fields will be filled with. The definitive solution would be to make a new API that is correct all the time or, my personal preference, to build a PTY mechanism directly into Windows and deprecate all the arguably terrible Windows Console APIs. But those solutions will take time.

In the mean time, as @zadjii-msft alluded to above, I recently wrote a massive test and ensured that the v2 console and the v1 console should be exactly the same and follow the below absolutely terrible too-many-dimensional matrix created from decades of bugs that are now preserved for compatibility. You are welcome to use this as a basis to try to figure out the right thing to provide through WinPty. Hopefully this will evolve into an MSDN article and/or blog post one day, time permitting (cough @bitcrazed cough).

There also might be some differences between what is listed below (the v1 behavior and the now-fixed v2 behavior) and what you see in some builds of Windows. That's because it got broken at some point and then fixed again at another point. This test is now in place and should be applicable going forward for all v2 consoles, but I'm not certain which builds the fix is in and which it is not in. I just know that this is what we should program to and fix anything that isn't working this way.

Also fair warning: I want to deprecate/remove Raster fonts from v2 in a future edition of the console, so please don't build too deeply on top of those. They don't scale for High DPI, they don't work for multilingual text, and they're just plain bad.

I would recommend that you choose one of these patterns that gives you the information that you need and uses a TrueType font selected (or accounts for the both font potentials using GetConsoleCurrentFontEx to see if the current font is a Raster font, typically Terminal). Patterns 4, 5, 6, 8, and 9 and the associated API call scenarios are probably closest to what you want, but you might need to detect a few others depending on what the hosted app used to write its text.


Scenarios

Reading with ReadConsoleOutput*

WriteConsoleOutput + ReadConsoleOutput

Other writes + ReadConsoleOutput*

Other writes = CRT write (printf, etc.) A or W -OR- WriteConsoleOutputCharacter -OR- WriteConsole

Reading with ReadConsoleOutputCharacter*

WriteConsoleOutputW + ReadConsoleOutputCharacter*

Other writes + ReadConsoleOutputCharacter*

Other writes = CRT write (printf, etc.) A or W -OR- WriteConsoleOutputCharacter -OR- WriteConsole -OR- WriteConsoleOutputA


Patterns

Each table below shows what you would get if you had written to the console with the following string and settings:

The Attr field is the color as retrieved by either ReadConsoleOutput* in the CHAR_INFO structure or the attrs returned via ReadConsoleOutputAttribute.

  1. NOTE: For ReadConsoleOutput* tables below, the attr and wchar (char) columns will be in sync depending on what is in the CHAR_INFO structure.
  2. NOTE: For ReadConsoleOutputCharacter* tables below, the attr column is taken from ReadConsoleOutputAttribute and the wchar (char) column is taken from ReadConsoleOutputCharacter*. This can result in the attrs appearing misaligned with the chars in the table form when the ReadConsoleOutputAttribute method ends up returning more data than the ReadConsoleOutputCharacter* method.

The Wchar (char) field is the string that will be returned. If you are using an API that returns a CHAR_INFO structure, both pieces will be returned in the union. If you use an API that simply returns a string, it will return the relevant half to the A/W type of API you called.

The Symbol field explains the data that was received.

Pattern 1

attr wchar (char) symbol
0x029 0x0051 (0x51) Q
0x029 0x3044 (0x44) Hiragana I
0x029 0x304B (0x4B) Hiragana KA
0x029 0x306A (0x6A) Hiragana NA
0x029 0x005A (0x5A) Z
0x029 0x0059 (0x59) Y
0x029 0x0058 (0x58) X
0x029 0x0057 (0x57) W
0x029 0x0056 (0x56) V
0x029 0x0055 (0x55) U
0x029 0x0054 (0x54) T
0x029 0x306B (0x6B) Hiragana NI
0x007 0x0020 (0x20) <space>
0x007 0x0020 (0x20) <space>
0x007 0x0020 (0x20) <space>
0x007 0x0020 (0x20) <space>

Pattern 2

attr wchar (char) symbol
0x029 0x0051 (0x51) Q
0x029 0x3044 (0x44) Hiragana I
0x029 0x304B (0x4B) Hiragana KA
0x029 0x306A (0x6A) Hiragana NA
0x029 0x005A (0x5A) Z
0x029 0x0059 (0x59) Y
0x029 0x0058 (0x58) X
0x029 0x0057 (0x57) W
0x029 0x0056 (0x56) V
0x007 0x0020 (0x20) <space>
0x007 0x0020 (0x20) <space>
0x007 0x0020 (0x20) <space>
0x007 0x0020 (0x20) <space>
0x000 0x0000 (0x00) <null>
0x000 0x0000 (0x00) <null>
0x000 0x0000 (0x00) <null>

Pattern 3

attr wchar (char) symbol
0x029 0x0051 (0x51) Q
0x029 0x3044 (0x44) Hiragana I
0x029 0x304B (0x4B) Hiragana KA
0x029 0x306A (0x6A) Hiragana NA
0x029 0x005A (0x5A) Z
0x029 0x0059 (0x59) Y
0x029 0x0058 (0x58) X
0x029 0x0057 (0x57) W
0x029 0x0056 (0x56) V
0x029 0x0055 (0x55) U
0x029 0x0054 (0x54) T
0x029 0x306B (0x6B) Hiragana NI
0x000 0x0000 (0x00) <null>
0x000 0x0000 (0x00) <null>
0x000 0x0000 (0x00) <null>
0x000 0x0000 (0x00) <null>

Pattern 4

attr wchar (char) symbol
0x029 0x0051 (0x51) Q
0x129 0x3044 (0x44) Hiragana I
0x229 0xFFFF (0xFF) Invalid Unicode Character
0x129 0x304B (0x4B) Hiragana KA
0x229 0xFFFF (0xFF) Invalid Unicode Character
0x129 0x306A (0x6A) Hiragana NA
0x229 0xFFFF (0xFF) Invalid Unicode Character
0x029 0x005A (0x5A) Z
0x029 0x0059 (0x59) Y
0x029 0x0058 (0x58) X
0x029 0x0057 (0x57) W
0x029 0x0056 (0x56) V
0x029 0x0055 (0x55) U
0x029 0x0054 (0x54) T
0x129 0x306B (0x6B) Hiragana NI
0x229 0xFFFF (0xFF) Invalid Unicode Character

Pattern 5

attr wchar (char) symbol
0x029 0x0051 (0x51) Q
0x129 0x3044 (0x44) Hiragana I
0x229 0x3044 (0x44) Hiragana I
0x129 0x304B (0x4B) Hiragana KA
0x229 0x304B (0x4B) Hiragana KA
0x129 0x306A (0x6A) Hiragana NA
0x229 0x306A (0x6A) Hiragana NA
0x029 0x005A (0x5A) Z
0x029 0x0059 (0x59) Y
0x029 0x0058 (0x58) X
0x029 0x0057 (0x57) W
0x029 0x0056 (0x56) V
0x029 0x0055 (0x55) U
0x029 0x0054 (0x54) T
0x129 0x306B (0x6B) Hiragana NI
0x229 0x306B (0x6B) Hiragana NI

Pattern 6

attr wchar (char) symbol
0x029 0x0051 (0x51) Q
0x129 0x0082 (0x82) Hiragana I Shift-JIS Codepage 932 Lead Byte
0x229 0x00A2 (0xA2) Hiragana I Shift-JIS Codepage 932 Trail Byte
0x129 0x0082 (0x82) Hiragana KA Shift-JIS Codepage 932 Lead Byte
0x229 0x00A9 (0xA9) Hiragana KA Shift-JIS Codepage 932 Trail Byte
0x129 0x0082 (0x82) Hiragana NA Shift-JIS Codepage 932 Lead Byte
0x229 0x00C8 (0xC8) Hiragana NA Shift-JIS Codepage 932 Trail Byte
0x029 0x005A (0x5A) Z
0x029 0x0059 (0x59) Y
0x029 0x0058 (0x58) X
0x029 0x0057 (0x57) W
0x029 0x0056 (0x56) V
0x029 0x0055 (0x55) U
0x029 0x0054 (0x54) T
0x129 0x0082 (0x82) Hiragana NI Shift-JIS Codepage 932 Lead Byte
0x229 0x00C9 (0xC9) Hiragana NI Shift-JIS Codepage 932 Trail Byte

Pattern 7

attr wchar (char) symbol
0x029 0x0051 (0x51) Q
0x129 0x3082 (0x82) Hiragana I Unicode 0x3044 with the lower byte covered by Shift-JIS Codepage 932 Lead Byte 0x82.
0x229 0xFFA2 (0xA2) Invalid Unicode Character 0xFFFF with the lower byte covered by Shift-JIS Codepage 932 Trail Byte 0xA2
0x129 0x3082 (0x82) Hiragana KA Unicode 0x304B with the lower byte covered by Shift-JIS Codepage 932 Lead Byte 0x82.
0x229 0xFFA9 (0xA9) Invalid Unicode Character 0xFFFF with the lower byte covered by Shift-JIS Codepage 932 Trail Byte 0xA9
0x129 0x3082 (0x82) Hiragana NA 0x306A with the lower byte covered by Shift-JIS Codepage 932 Lead Byte 0x82.
0x229 0xFFC8 (0xC8) Invalid Unicode Character 0xFFFF with the lower byte covered by Shift-JIS Codepage 932 Trail Byte 0xC8
0x029 0x005A (0x5A) Z
0x029 0x0059 (0x59) Y
0x029 0x0058 (0x58) X
0x029 0x0057 (0x57) W
0x029 0x0056 (0x56) V
0x007 0x0020 (0x20) <space>
0x007 0x0020 (0x20) <space>
0x007 0x0020 (0x20) <space>
0x007 0x0020 (0x20) <space>

Pattern 8

attr wchar (char) symbol
0x029 0x0051 (0x51) Q
0x129 0x3082 (0x82) Hiragana I Unicode 0x3044 with the lower byte covered by Shift-JIS Codepage 932 Lead Byte 0x82.
0x229 0xFFA2 (0xA2) Invalid Unicode Character 0xFFFF with the lower byte covered by Shift-JIS Codepage 932 Trail Byte 0xA2
0x129 0x3082 (0x82) Hiragana KA Unicode 0x304B with the lower byte covered by Shift-JIS Codepage 932 Lead Byte 0x82.
0x229 0xFFA9 (0xA9) Invalid Unicode Character 0xFFFF with the lower byte covered by Shift-JIS Codepage 932 Trail Byte 0xA9
0x129 0x3082 (0x82) Hiragana NA 0x306A with the lower byte covered by Shift-JIS Codepage 932 Lead Byte 0x82.
0x229 0xFFC8 (0xC8) Invalid Unicode Character 0xFFFF with the lower byte covered by Shift-JIS Codepage 932 Trail Byte 0xC8
0x029 0x005A (0x5A) Z
0x029 0x0059 (0x59) Y
0x029 0x0058 (0x58) X
0x029 0x0057 (0x57) W
0x029 0x0056 (0x56) V
0x029 0x0055 (0x55) U
0x029 0x0054 (0x54) T
0x129 0x3082 (0x30) Hiragana NI 0x306B with the lower byte covered by Shift-JIS Codepage 932 Lead Byte 0x82.
0x229 0xFFC9 (0xC9) Invalid Unicode Character 0xFFFF with the lower byte covered by Shift-JIS Codepage 932 Trail Byte 0xC9

Pattern 9

attr wchar (char) symbol
0x029 0x0051 (0x51) Q
0x129 0x3082 (0x82) Hiragana I Unicode 0x3044 with the lower byte covered by Shift-JIS Codepage 932 Lead Byte 0x82.
0x229 0x30A2 (0xA2) Hiragana I Unicode 0x3044 with the lower byte covered by Shift-JIS Codepage 932 Trail Byte 0xA2
0x129 0x3082 (0x82) Hiragana KA Unicode 0x304B with the lower byte covered by Shift-JIS Codepage 932 Lead Byte 0x82.
0x229 0x30A9 (0xA9) Hiragana KA Unicode 0x304B with the lower byte covered by Shift-JIS Codepage 932 Trail Byte 0xA9
0x129 0x3082 (0x82) Hiragana NA 0x306A with the lower byte covered by Shift-JIS Codepage 932 Lead Byte 0x82.
0x229 0x39C8 (0xC8) Hiragana NA 0x306A with the lower byte covered by Shift-JIS Codepage 932 Trail Byte 0xC8
0x029 0x005A (0x5A) Z
0x029 0x0059 (0x59) Y
0x029 0x0058 (0x58) X
0x029 0x0057 (0x57) W
0x029 0x0056 (0x56) V
0x029 0x0055 (0x55) U
0x029 0x0054 (0x54) T
0x129 0x3082 (0x30) Hiragana NI 0x306B with the lower byte covered by Shift-JIS Codepage 932 Lead Byte 0x82.
0x229 0x30C9 (0xC9) Hiragana NI 0x306B with the lower byte covered by Shift-JIS Codepage 932 Trail Byte 0xC9

Pattern 10

attr wchar (char) symbol
0x029 0x0051 (0x51) Q
0x129 0x3044 (0x44) Hiragana I
0x229 0x304B (0x4B) Hiragana KA
0x129 0x306A (0x6A) Hiragana NA
0x229 0x005A (0x5A) Z
0x129 0x0059 (0x59) Y
0x229 0x0058 (0x58) X
0x029 0x0057 (0x57) W
0x029 0x0056 (0x56) V
0x029 0x0055 (0x55) U
0x029 0x0054 (0x54) T
0x029 0x306B (0x6B) Hiragana NI
0x029 0x0000 (0x00) <null>
0x029 0x0000 (0x00) <null>
0x129 0x0000 (0x00) <null>
0x229 0x0000 (0x00) <null>

Pattern 11

attr wchar (char) symbol
0x029 0x0051 (0x51) Q
0x029 0x3044 (0x44) Hiragana I
0x029 0x304B (0x4B) Hiragana KA
0x029 0x306A (0x6A) Hiragana NA
0x029 0x005A (0x5A) Z
0x029 0x0059 (0x59) Y
0x029 0x0058 (0x58) X
0x029 0x0057 (0x57) W
0x029 0x0056 (0x56) V
0x029 0x0055 (0x55) U
0x029 0x0054 (0x54) T
0x029 0x306B (0x6B) Hiragana NI
0x007 0x0020 (0x20) <space>
0x007 0x0020 (0x20) <space>
0x007 0x0020 (0x20) <space>
0x007 0x0020 (0x20) <space>

Pattern 12

attr wchar (char) symbol
0x029 0x0051 (0x51) Q
0x129 0x3044 (0x44) Hiragana I
0x229 0x304B (0x4B) Hiragana KA
0x129 0x306A (0x6A) Hiragana NA
0x229 0x005A (0x5A) Z
0x129 0x0059 (0x59) Y
0x229 0x0058 (0x58) X
0x029 0x0057 (0x57) W
0x029 0x0056 (0x56) V
0x029 0x0020 (0x20) <space>
0x029 0x0020 (0x20) <space>
0x029 0x0020 (0x20) <space>
0x007 0x0020 (0x20) <space>
0x007 0x0000 (0x00) <null>
0x007 0x0000 (0x00) <null>
0x007 0x0000 (0x00) <null>

Pattern 13

attr wchar (char) symbol
0x029 0x0051 (0x51) Q
0x129 0x0082 (0x82) Hiragana I Shift-JIS Codepage 932 Lead Byte
0x229 0x00A2 (0xA2) Hiragana I Shift-JIS Codepage 932 Trail Byte
0x129 0x0082 (0x82) Hiragana KA Shift-JIS Codepage 932 Lead Byte
0x229 0x00A9 (0xA9) Hiragana KA Shift-JIS Codepage 932 Trail Byte
0x129 0x0082 (0x82) Hiragana NA Shift-JIS Codepage 932 Lead Byte
0x229 0x00C8 (0xC8) Hiragana NA Shift-JIS Codepage 932 Trail Byte
0x029 0x005A (0x5A) Z
0x029 0x0059 (0x59) Y
0x029 0x0058 (0x58) X
0x029 0x0057 (0x57) W
0x029 0x0056 (0x56) V
0x007 0x0020 (0x20) <space>
0x007 0x0020 (0x20) <space>
0x007 0x0020 (0x20) <space>
0x007 0x0020 (0x20) <space>

Pattern 14

attr wchar (char) symbol
0x029 0x0000 (0x00) <null>
0x029 0x0000 (0x00) <null>
0x029 0x0000 (0x00) <null>
0x029 0x0000 (0x00) <null>
0x029 0x0000 (0x00) <null>
0x029 0x0000 (0x00) <null>
0x029 0x0000 (0x00) <null>
0x029 0x0000 (0x00) <null>
0x029 0x0000 (0x00) <null>
0x029 0x0000 (0x00) <null>
0x029 0x0000 (0x00) <null>
0x029 0x0000 (0x00) <null>
0x007 0x0000 (0x00) <null>
0x007 0x0000 (0x00) <null>
0x007 0x0000 (0x00) <null>
0x007 0x0000 (0x00) <null>
miniksa commented 7 years ago

Just a quick note as confirmation: The original bug in the API discrepancy has been fixed in the last handful of Windows Insiders builds. This was MSFT: 10187355.

We then made a copy of it to port it back from the Windows Insider builds to the Creators Update as MSFT: 11721571. I've heard that the Creators Update KB fix has apparently just gone live in KB4020102 (OS build number 15063.332).

rprichard commented 7 years ago

@miniksa Thanks for your detailed reply, and for fixing the issue in Windows.

It wasn't completely clear to me, but I think in the above patterns, for WriteConsoleOutput*, the write region is always 1 row and has as many columns as the string length? I'm guessing the written attribute also never includes 0x100 or 0x200. So, for WriteConsoleOutputA, the write region would be of size 16x1, but for WriteConsoleOutputW, the size would be 12x1. Is that right?

winpty always currently reads using ReadConsoleOutputW, and it configures a TrueType font according to the terminal size and console code page (Lucida Console by default, MS Gothic (CP932), NSimSun (CP936), Gulim Che (CP949), or Ming Light (CP950)). So far, I've just assumed that applications won't configure the console font themselves. Given these constraints, it looks like pattern 5 is the one winpty should mostly care about.

I think winpty could accommodate a raster font by converting the ReadConsoleOutputW CHAR_INFO buffer from pattern 3 to 5. I'm imagining it could have a built-in table indicating which code points occupied two cells. IIRC, when I last looked at raster fonts, the console converted any WCHAR it didn't recognize into a question mark, so I'm not inclined to use raster fonts.

Patterns 1 and 2 seem like misuses of the console API. Pattern 1 is placing wide characters into single cells, which will tend to cause alignment problems (e.g. an 80 column line with 160 columns of text in it). If my assumption above about write regions is correct, then pattern 2 is providing extraneous CHAR_INFO data. The WriteConsoleOutputW call ought to specify a 16x1 region (not 12x1) and put four nulls at the end. i.e. The write buffer should be identical to the buffer in pattern 3.

The 0xFFFF invalid character in patterns 4 and 8 is weird; I'm guessing it's there for backwards compatibility? winpty doesn't currently handle it, but I guess it should. It currently expects the trailing and leading wide char to equal.

FWIW, winpty also tries to recognize UTF-16 surrogate pairs, though I'm not sure that was a good idea in retrospect. I wonder what happens with the 65001 codepage combined with WriteConsoleA of a character outside the BMP.

If you want to document more hairiness, I'm sure there are edge cases involving I/O to a single cell of a two-cell character. I think ReadConsoleOutputW will read a space?

I just updated my 15063 VM tonight, and the KB4020102 update automatically installed, which bumped the winver up from 15063.296 to 15063.332. The output from winbug-15048.exe looks good now. I think I'm less inclined now to implement a workaround.

(Note to self: the patterns describe the fixed state, not the buggy state.)

miniksa commented 7 years ago

@rprichard Apologies. I should have specified more about how the test writes.

For all writes, we're starting with a cleared out buffer (all set to space characters 0x20 and the default background color 0x7.) We also always write to the 0,0 position in this test.

Then from there, WriteConsoleOutput* allocates CHAR_INFO array that is the length of the string. For W-versions, this is the wcslen() length of the original string. For A-versions, this is the length of the result from calling MultiByteToWideChar on the original W-string. Each CHAR_INFO is filled with a character from the string (wide or narrow portion of the unions respectively) and the applied attribute 0x29 to change the color. The lead/trailing flags 0x100 and 0x200 are not set on write. This means that for the W-string we are writing 12 CHAR_INFOs and for the A-string we are writing 16 CHAR_INFOs. 12x1 and 16x1 as you stated for the write region are correct.

For the CRT write tests, we clear the buffer the same way and set the cursor to 0,0 again. For writing W versions, we call _setmode(_fileno(stdout), _O_WTEXT) to make sure the CRT doesn't try to be helpful and convert our text then use putwchar to put each character in a loop. For the A version, we call _setmode(_fileno(stdout), _O_TEXT) and use putchar. By default, the CRT is in _O_TEXT mode and will convert anything you write with putwchar or wprintf on your behalf back into A text before sending to the console, so setting the mode is important to maintain the integrity of the bytes being emitted.

For the WriteConsole and WriteConsoleOutputCharacter* writes, we set the target to 0,0 and pass the entire string, either the 12-length W-string or the 16-length A-string (post WC2MB conversion). We don't try to set the colors in these modes during the test patterns.


Then regarding raster fonts, you are correct. If the character doesn't exist in the currently selected raster font, it will generally convert it into the default character ?. Raster fonts typically only have a very small subset of characters represented, so there will be lots of ?. For TrueType fonts, the character is typically maintained in the buffer without respect to whether the font can actually draw it.


One thing to note that I think you may have misunderstood: all of these patterns are what you will see when attempting to read back 16 characters no matter what write mechanism was used. The write mechanisms vary as specified above. But reading back 16 items from the Read APIs will result in these patterns. To that end, Patterns 1 and 2 are representing a Read back, not what was specified on Write. The write buffer did indeed look like Pattern 3 (but 12 long instead of 16) for writing the W version of the text with the WriteConsoleOutput API, but the write wasn't what I was intending to describe/convey with the patterns.


The 0xFFFF invalid character in patterns 4 and 8 is actually a bug that leaked out through the API and now is maintained for compatibility. The origin of it is that the console historically used like 3 different independent mechanisms internally to recognize the column width of any character that organically developed over time. Each developer came in and added their own without context and so it went. The internal console buffer was always stored in the codepage that was going to be used for display. In one of these forms, one of the developers decided to put 0xFFFF in the trail to know that it would take 2 columns/bytes (treated interchangeably even though that's not strictly true). A different developer used the 0x100 and 0x200 flags (also as an interchangeable metric of column width and bytes...). And so on.

This sort of organic development without context over time is also how we ended up with some patterns like 8 and 9 with an A byte stomped on top of a W character in the CHAR_INFO structure on read...

In the last few years, I rearranged this so the buffer internally is always stored as Unicode text and it is translated as necessary on the way in/out through the APIs and when being given to GDI. It also always uses the 0x100 and 0x200 flags to know the column width and uses MB2WC or WC2MB whenever it needs the byte count. This makes for a lot less internal code complexity to figure out what form the buffer is currently stored in. However, the compatibility police came after me and said I had to maintain the API surface, so there's a function to re-munge the trailing byte to an 0xFFFF on the way out when certain states exist. So you'll have to expect/deal with that. :(


The windows console doesn't support UTF-16 surrogate pairs right now. I want to do that in the future, but it's more accurate to say we support UCS-2 than to say we support UTF-16.

UTF-8 (codepage 65001 on the A APIs) is also not officially supported. It works some times and on some of the APIs, but it's not complete and there are gaps/holes. I would expect it to work in strange and interesting ways.

Officially, the console supports 2 byte UCS-2 through the W versions of the API. On the A API, we support code pages that are 1 byte for the "Western" world and we support 4 specific 2 byte codepages for "CJK" regions: 932, 936, 949, and 950. Anything else was never officially implemented or supported and your mileage may vary significantly.


Do you have a specific example of what you mean by your single cell I/O of a two-cell character? I can log a bug/task internally to investigate, test, and further document that. This is basically all happening on-demand as we discover scenarios/problems. So if you have a specific scenario/problem, please let me know!


OK cool. If you don't need to implement a workaround, even better. Hopefully this still provides some good insight into what's happening and why and how for future reference.

Thanks for your patience and cooperation! --Michael

rprichard commented 7 years ago

@miniksa Sorry, I forgot about this.

Do you have a specific example of what you mean by your single cell I/O of a two-cell character? I can log a bug/task internally to investigate, test, and further document that. This is basically all happening on-demand as we discover scenarios/problems. So if you have a specific scenario/problem, please let me know!

I was thinking of things like:

  1. Write a two-column character, then use ReadConsoleOutputW to read only the first or second column (but not both).
  2. Overwrite only the first or last column of a two-column character already on the screen (using any of WriteConsole[AW], WriteConsoleOutput[AW], WriteFile, etc.).
  3. Write a two-column character on the last column of the console.

IIRC, in scenario 1, the console will pretend to read a space (U+0020) character. In scenario 2, I think it will replace the other half of the two-column character(s) with a space. I wouldn't be surprised if the answer is more complicated. :-P

For scenario 3, I'm guessing the last column is replaced with a space, and the character is wrapped around to the next line? What if the screen buffer is only 1 column wide (which also implies a gigantic font)?

For synchronization purposes, winpty issues a single ReadConsoleOutputW call for the entire first column of the buffer, and if it issues a read for N lines, it expects to find exactly N CHAR_INFO records in its buffer. It only cares about lines that have a "sync marker" in them. Otherwise, winpty always reads (or clears) whole lines at once.