microsoft / terminal

The new Windows Terminal and the original Windows console host, all in the same place!
MIT License
94.33k stars 8.17k forks source link

`more` doesn't properly support upper-range BMP / 3B UTF-8 chars #8974

Open trollcop opened 3 years ago

trollcop commented 3 years ago

Environment

Win32NT             10.0.19042.0 Microsoft Windows NT 10.0.19042.0
Windows Terminal version: latest Preview from store

Any other software?
TSDuck (console application)

Steps to reproduce

1) download latest tsduck nightly from https://tsduck.io/download/prerelease/ 2) download eit.zip from https://github.com/tsduck/tsduck/issues/720 (at the top of report) and extract eit.ts 3) extract tsduck and run bin\tstables.exe --japan eit.ts 4) The output looks like this: image

Expected behavior

After repeating 1-3 above, the output should look like the third image here https://github.com/tsduck/tsduck/issues/720#issuecomment-770239722 (2 lines of Japanese text, "Event name:" and "Description:" followed by Japanese text.

      Event name: "日本共産党自主放送"
      Description: "色は匂へど散りぬるを我が世誰ぞ常ならん有為の奥山今日越えて浅き夢見じ酔ひもせず"

Actual behavior

Entire "Description: " line is NOT displayed, like it simply doesn't exist. Note, the same problem occurs in cmd.exe and powershell command prompts (using native win32 console), so I'm not actually sure what causes the problem, but according to tsduck dev it is windows/terminal/console fault.

Additionally, if using console redirection, such as bin\tstables.exe --japan eit.ts > file.txt from cmd.exe, both Japanese lines are present in file. So they're definitely being "printed" just not displayed.

skyline75489 commented 3 years ago

This is easily reproducible with PowerShell. The tstables.exe seems to be piping the text into more, which turns out to be not working properly. I cannot navigate like using a normal more.

What's interesting is that the issue does NOT exist on Ubuntu(WSL). Everything just works.

zadjii-msft commented 3 years ago

God, I'm not in love with that maintainer's attitude. "if you want to avoid problems when running technical tools (ie. non-Office desktop applications), avoid Windows". uhg. Let's see what we can do to change their mind.

It sure does sound like there's some sort of bug here in more. That makes sense to me - more was written before I was born, and I'm pretty sure it hasn't been updated since. I'd love to see a minimal repro with more that repros this kind of an issue.

I'm worried that at the end of the day, this bug has existed in more since 1985. If that's the case, then we probably can't fix this in more itself, and it's unlikely we could fix it in conhost. But there's not really another pager on Windows, so hmmm....🤔

trollcop commented 3 years ago

I'd love to see a minimal repro with more that repros this kind of an issue.

Well, I am not sure how that would work. I piped the output to a file, here: eit.txt

But doing Get-Content -Encoding UTF8 .\eit.txt | more actually just prints "?????"'s (yes, even with chcp 65001) which means somehow during the piping the encoding information is lost?...

eryksun commented 3 years ago

It sure does sound like there's some sort of bug here in more.

"more.com" is part of a set of utilities that are built around "ulib.dll" (file utilities support DLL). Probably the C++ implementation hasn't changed much since the initial NT 3.1 implementation in the early 1990s.

The bug is in ulib's WSTRING::QueryByteCount method. In order to get the byte count, it converts to 'OEM' (actually the console input code page) via WSTRING::ConvertUnicodeToOemN. When allocating the scratch buffer for the WideCharToMultiByte call, it assumes that the encoding uses no more than 2 bytes per WCHAR element in the wide-character string. That used to be a valid assumption, but Windows 10 allows OEM to be UTF-8 (system wide, or per application), which requires up to 3 bytes per WCHAR . (A non-BMP character is encoded with 4 bytes in UTF-8, but that's 2 bytes per WCHAR in a UTF-16 surrogate pair.) Moreover, it turns out that WSTRING::ConvertUnicodeToOemN isn't actually using the OEM code page of the current process, but instead calls ulib's ConvertToOemWithConsoleCP, which uses the console's input code page (that's an unrelated bug, i.e. assuming the input and output code page are the same). If the assumed maximum buffer size is too small for the UTF-8 conversion, then QueryByteCount fails and returns (ULONG)-1.

In this case, more's PAGER::DisplayPage method just skips the line. This is why the "Description" line gets skipped. To test this, you just need a file that contains a line with an upper-range BMP character that encodes as 3 bytes in UTF-8. To show it as an obviously skipped line, insert it between two lines of ASCII text, e.g. "spam\n色\neggs". Then display the file with "more.com", after first changing the console's input and output code pages to UTF-8 via chcp.com 65001.

Other than the obvious bug with QueryByteCount, in terms of design, I have to wonder why DisplayPage is using the encoded byte count to determine the line length, presumably for wrapping long lines. I would expect it to use WSTRING::QueryChCount to get the line length in characters. That's at least valid for simple scripts in the BMP.