microsoft / terminal

The new Windows Terminal and the original Windows console host, all in the same place!
MIT License
95.73k stars 8.33k forks source link

WriteConsoleOutputCharacterA doesn't merge UTF-8 partials in successive calls #1851

Open german-one opened 5 years ago

german-one commented 5 years ago

Environment

Windows build number: [run "ver" at a command prompt]
Microsoft Windows [Version 10.0.18362.207]

Steps to reproduce

If a UTF-8 stream gets buffered in a loop, characters that consume more than one byte may get split at the buffer boundaries. Passing the buffer to WriteConsoleOutputCharacterA will corrupt the text because a conversion to UTF-16 is in place where these partials are treated as invalid UTF-8 characters and replaced with with U+FFFD characters.

Expected behavior

WriteConsoleOutputCharacterA should cache the partials and prepend them to the characters passed at the next call of this function, similar to the behavior of WriteConsoleA.

Actual behavior

UTF-8 partials result in corrupted text. A discussion about this already began in #386 but was rather out of scope in this issue.

german-one commented 5 years ago

@miniksa
This is continuing #386.

I was able to include the Utf8ToWideCharParser class into ApiRoutines::WriteConsoleOutputCharacterAImpl. This resolves the problem of merging UTF-8 partials but raises another issue. I'm struggling with the used parameter. I won't start a PR unless we have the same understanding of what number has to be assigned to it. My current understanding is that it is the number of bytes consumed from chars. But does it include the number of bytes of the partials (either from the previous or from the current cache)? And what's the benefit of getting this information as the caller of this function? In my tests I tried to calculate the coordinates of the end of the previously written chunk in order to continue writing at this position. This doesn't work at all if I try to use the bytes written. I would rather need the number of character cells written. Would it be reasonable to use the number of characters created that I get from Utf8ToWideCharParser::Parse (likewise the characters written by WriteConsoleOutputCharacterWImpl)?

j4james commented 4 years ago

The discussion in #386 seemed to suggest that this problem had been around since Windows 7, but what I'm seeing definitely looks like a recent regression, so I'm not sure if it's the same issue. In my current Windows version (10.0.18362.535), the following sequences both output a smiling face when executed in a bash shell.

printf "\xE2\x98\xBA\n"
printf "\xE2"; printf "\x98\xBA\n"

However, when I start a conhost shell built from the latest source (commit 39d3c6542006fa3cd45d85fc00e9d546a1398625), the second sequence fails to decode the UTF-8 correctly, and three error glyphs are output instead.

Screenshots of the old and new consoles...

image

german-one commented 4 years ago

Not sure what functions are invoked under the hood if you call printf in WSL. Although I'm pretty sure that the behavior shouldn't have changed if it was WriteConsoleOutputCharacterAImpl. The ConvertToW function is called there which converts the passed string to UTF-16 without caching UTF-8 partials. That's the reason why I once filed this issue.

However, issue #4086 has been fixed with PR #4422 which indicates that WriteConsoleAImpl is involved. Not sure if I did something wrong when I updated this function.

j4james commented 4 years ago

I'm afraid it does look like PR #4422 is to blame, at least for my particular test case. It works in commit 0d92f71e45c97827f4219d69a1ea48811f9d70dd, but fails after #4422 is applied in commit 06b393141864ae52b41aefa560918aa2e1aadb85.

It seems that the old Utf8ToWideCharParser::Parse method does actually detect \xE2 as a partial (and then successfully combines it with the following two bytes when they later arrive), but the new u8u16state code doesn't. The utf8 conversion method just returns the string as if it were a complete set of utf8 code points, leaving _partialsLen as zero. It's then passed on to the MultiByteToWideChar API where it gets converted into an error glyph.

german-one commented 4 years ago

I updated the unit test with this sequences and it failed. So it's definitely reproduceable. I'll file an issue referring to your comment here. And I'll fix it as soon as I can. Thanks for letting me know!

german-one commented 2 years ago

@zadjii-msft I'm struggling to close this myself because it is an accepted issues and added to a milestone. However, meanwhile I'm not sure about it anymore.

WriteConsoleOutputCharacterA is no longer a part of the ecosystem roadmap, as remarked in the docs. WriteConsoleOutputCharacterA does not and did never respect character boundaries of DBCSs.

The docs also read

"Copies a number of characters to consecutive cells of a console screen buffer, ..."

Consider to use the answers to these questions as reasons to close this one out.

Steffen

miniksa commented 2 years ago
  • What is the meaning of "characters" here? (Is it actually "the representation of code points"? Or is it rather "values of a C/C++ character type"?)

In the context of WriteConsoleOutputCharacterA, it means one char into one cell of the buffer. Does it make wonderful sense? No. But it is the way that it is when we inherited it.

  • What separates WriteConsoleOutputCharacterA from WriteConsoleA? (There might be a reason why they behave differently.)

WriteConsoleOutputCharacterA is just going to modify the character without modifying the color attributes in the cell. It will insert pretty much exactly what you give it exactly into the cell you tell it to. WriteConsoleA will run normal character handling against it as if it were otherwise streamed in including often processing backspaces, newlines, etc. and applying the active/default color attributes to whatever you're inserting.

  • How inconsistent is it to fix UTF-8 handling without fixing DBCS handling?

It's probably fine honestly to fix the UTF-8 handling as a separate path. We did that in a few places... there's 3 states... W, A, and A when 65001 (UTF-8) is set. That way we can maintain the nice and broken A state for compatibility reasons and fix up the UTF-8 that we actually care about independently.

  • Do we break code if we try to fix it?

Oh probably.

Consider to use the answers to these questions as reasons to close this one out.

I'll let it hang around for now, but don't feel obligated to further bump it. If it's on 22H2 we're going to keep looking at it ourselves.