Open trollcop opened 3 years ago
This is easily reproducible with PowerShell. The tstables.exe
seems to be piping the text into more
, which turns out to be not working properly. I cannot navigate like using a normal more
.
What's interesting is that the issue does NOT exist on Ubuntu(WSL). Everything just works.
God, I'm not in love with that maintainer's attitude. "if you want to avoid problems when running technical tools (ie. non-Office desktop applications), avoid Windows". uhg. Let's see what we can do to change their mind.
It sure does sound like there's some sort of bug here in more
. That makes sense to me - more
was written before I was born, and I'm pretty sure it hasn't been updated since. I'd love to see a minimal repro with more
that repros this kind of an issue.
I'm worried that at the end of the day, this bug has existed in more
since 1985. If that's the case, then we probably can't fix this in more
itself, and it's unlikely we could fix it in conhost
. But there's not really another pager on Windows, so hmmm....🤔
I'd love to see a minimal repro with
more
that repros this kind of an issue.
Well, I am not sure how that would work. I piped the output to a file, here: eit.txt
But doing Get-Content -Encoding UTF8 .\eit.txt | more
actually just prints "?????"'s (yes, even with chcp 65001
) which means somehow during the piping the encoding information is lost?...
It sure does sound like there's some sort of bug here in
more
.
"more.com" is part of a set of utilities that are built around "ulib.dll" (file utilities support DLL). Probably the C++ implementation hasn't changed much since the initial NT 3.1 implementation in the early 1990s.
The bug is in ulib's WSTRING::QueryByteCount
method. In order to get the byte count, it converts to 'OEM' (actually the console input code page) via WSTRING::ConvertUnicodeToOemN
. When allocating the scratch buffer for the WideCharToMultiByte
call, it assumes that the encoding uses no more than 2 bytes per WCHAR
element in the wide-character string. That used to be a valid assumption, but Windows 10 allows OEM to be UTF-8 (system wide, or per application), which requires up to 3 bytes per WCHAR
. (A non-BMP character is encoded with 4 bytes in UTF-8, but that's 2 bytes per WCHAR
in a UTF-16 surrogate pair.) Moreover, it turns out that WSTRING::ConvertUnicodeToOemN
isn't actually using the OEM code page of the current process, but instead calls ulib's ConvertToOemWithConsoleCP
, which uses the console's input code page (that's an unrelated bug, i.e. assuming the input and output code page are the same). If the assumed maximum buffer size is too small for the UTF-8 conversion, then QueryByteCount
fails and returns (ULONG)-1
.
In this case, more's PAGER::DisplayPage
method just skips the line. This is why the "Description" line gets skipped. To test this, you just need a file that contains a line with an upper-range BMP character that encodes as 3 bytes in UTF-8. To show it as an obviously skipped line, insert it between two lines of ASCII text, e.g. "spam\n色\neggs"
. Then display the file with "more.com", after first changing the console's input and output code pages to UTF-8 via chcp.com 65001
.
Other than the obvious bug with QueryByteCount
, in terms of design, I have to wonder why DisplayPage
is using the encoded byte count to determine the line length, presumably for wrapping long lines. I would expect it to use WSTRING::QueryChCount
to get the line length in characters. That's at least valid for simple scripts in the BMP.
Environment
Steps to reproduce
1) download latest tsduck nightly from https://tsduck.io/download/prerelease/ 2) download![image](https://user-images.githubusercontent.com/4640472/106371312-55cab100-63a6-11eb-82eb-3ecaade7efd5.png)
eit.zip
from https://github.com/tsduck/tsduck/issues/720 (at the top of report) and extracteit.ts
3) extract tsduck and runbin\tstables.exe --japan eit.ts
4) The output looks like this:Expected behavior
After repeating 1-3 above, the output should look like the third image here https://github.com/tsduck/tsduck/issues/720#issuecomment-770239722 (2 lines of Japanese text, "Event name:" and "Description:" followed by Japanese text.
Actual behavior
Entire "Description: " line is NOT displayed, like it simply doesn't exist. Note, the same problem occurs in cmd.exe and powershell command prompts (using native win32 console), so I'm not actually sure what causes the problem, but according to tsduck dev it is windows/terminal/console fault.
Additionally, if using console redirection, such as
bin\tstables.exe --japan eit.ts > file.txt
from cmd.exe, both Japanese lines are present in file. So they're definitely being "printed" just not displayed.