Open german-one opened 5 years ago
@miniksa
This is continuing #386.
I was able to include the Utf8ToWideCharParser
class into ApiRoutines::WriteConsoleOutputCharacterAImpl
. This resolves the problem of merging UTF-8 partials but raises another issue. I'm struggling with the used
parameter. I won't start a PR unless we have the same understanding of what number has to be assigned to it. My current understanding is that it is the number of bytes consumed from chars
. But does it include the number of bytes of the partials (either from the previous or from the current cache)? And what's the benefit of getting this information as the caller of this function? In my tests I tried to calculate the coordinates of the end of the previously written chunk in order to continue writing at this position. This doesn't work at all if I try to use the bytes written. I would rather need the number of character cells written. Would it be reasonable to use the number of characters created that I get from Utf8ToWideCharParser::Parse
(likewise the characters written by WriteConsoleOutputCharacterWImpl
)?
The discussion in #386 seemed to suggest that this problem had been around since Windows 7, but what I'm seeing definitely looks like a recent regression, so I'm not sure if it's the same issue. In my current Windows version (10.0.18362.535), the following sequences both output a smiling face when executed in a bash shell.
printf "\xE2\x98\xBA\n"
printf "\xE2"; printf "\x98\xBA\n"
However, when I start a conhost shell built from the latest source (commit 39d3c6542006fa3cd45d85fc00e9d546a1398625), the second sequence fails to decode the UTF-8 correctly, and three error glyphs are output instead.
Screenshots of the old and new consoles...
Not sure what functions are invoked under the hood if you call printf
in WSL. Although I'm pretty sure that the behavior shouldn't have changed if it was WriteConsoleOutputCharacterAImpl
. The ConvertToW
function is called there which converts the passed string to UTF-16 without caching UTF-8 partials. That's the reason why I once filed this issue.
However, issue #4086 has been fixed with PR #4422 which indicates that WriteConsoleAImpl
is involved. Not sure if I did something wrong when I updated this function.
I'm afraid it does look like PR #4422 is to blame, at least for my particular test case. It works in commit 0d92f71e45c97827f4219d69a1ea48811f9d70dd, but fails after #4422 is applied in commit 06b393141864ae52b41aefa560918aa2e1aadb85.
It seems that the old Utf8ToWideCharParser::Parse
method does actually detect \xE2
as a partial (and then successfully combines it with the following two bytes when they later arrive), but the new u8u16state
code doesn't. The utf8 conversion method just returns the string as if it were a complete set of utf8 code points, leaving _partialsLen
as zero. It's then passed on to the MultiByteToWideChar
API where it gets converted into an error glyph.
I updated the unit test with this sequences and it failed. So it's definitely reproduceable. I'll file an issue referring to your comment here. And I'll fix it as soon as I can. Thanks for letting me know!
@zadjii-msft I'm struggling to close this myself because it is an accepted issues and added to a milestone. However, meanwhile I'm not sure about it anymore.
WriteConsoleOutputCharacterA
is no longer a part of the ecosystem roadmap, as remarked in the docs.
WriteConsoleOutputCharacterA
does not and did never respect character boundaries of DBCSs.
The docs also read
"Copies a number of characters to consecutive cells of a console screen buffer, ..."
WriteConsoleOutputCharacterA
from WriteConsoleA
? (There might be a reason why they behave differently.)Consider to use the answers to these questions as reasons to close this one out.
Steffen
- What is the meaning of "characters" here? (Is it actually "the representation of code points"? Or is it rather "values of a C/C++ character type"?)
In the context of WriteConsoleOutputCharacterA
, it means one char
into one cell of the buffer. Does it make wonderful sense? No. But it is the way that it is when we inherited it.
- What separates
WriteConsoleOutputCharacterA
fromWriteConsoleA
? (There might be a reason why they behave differently.)
WriteConsoleOutputCharacterA
is just going to modify the character without modifying the color attributes in the cell. It will insert pretty much exactly what you give it exactly into the cell you tell it to.
WriteConsoleA
will run normal character handling against it as if it were otherwise streamed in including often processing backspaces, newlines, etc. and applying the active/default color attributes to whatever you're inserting.
- How inconsistent is it to fix UTF-8 handling without fixing DBCS handling?
It's probably fine honestly to fix the UTF-8 handling as a separate path. We did that in a few places... there's 3 states... W, A, and A when 65001 (UTF-8) is set. That way we can maintain the nice and broken A state for compatibility reasons and fix up the UTF-8 that we actually care about independently.
- Do we break code if we try to fix it?
Oh probably.
Consider to use the answers to these questions as reasons to close this one out.
I'll let it hang around for now, but don't feel obligated to further bump it. If it's on 22H2 we're going to keep looking at it ourselves.
Environment
Steps to reproduce
If a UTF-8 stream gets buffered in a loop, characters that consume more than one byte may get split at the buffer boundaries. Passing the buffer to
WriteConsoleOutputCharacterA
will corrupt the text because a conversion to UTF-16 is in place where these partials are treated as invalid UTF-8 characters and replaced with with U+FFFD characters.Expected behavior
WriteConsoleOutputCharacterA
should cache the partials and prepend them to the characters passed at the next call of this function, similar to the behavior ofWriteConsoleA
.Actual behavior
UTF-8 partials result in corrupted text. A discussion about this already began in #386 but was rather out of scope in this issue.