Open gpakosz opened 6 years ago
@gpakosz , thank you, do you have a proposal to consider?
Well, we store UTF-8 strings in std::string<char>
, because we do everything in UTF-8 and there's no better representation in C++.
Iterating over byte code units, and asking for each one IsPrintableAscii()
doesn't make sense.
But since we're not gtest
architects, and didn't find a clean way to improve the situation, we just hacked our gtest-printers.cc
copy to always "print as is":
template <typename UnsignedChar, typename Char>
static CharFormat PrintAsCharLiteralTo(Char c, ostream* os) {
switch (static_cast<wchar_t>(c)) {
case L'\0':
*os << "\\0";
break;
case L'\'':
*os << "\\'";
break;
case L'\\':
*os << "\\\\";
break;
case L'\a':
*os << "\\a";
break;
case L'\b':
*os << "\\b";
break;
case L'\f':
*os << "\\f";
break;
case L'\n':
*os << "\\n";
break;
case L'\r':
*os << "\\r";
break;
case L'\t':
*os << "\\t";
break;
case L'\v':
*os << "\\v";
break;
default:
*os << static_cast<char>(c);
return kAsIs;
}
return kSpecialEscape;
}
@gpakosz : It sounds like you're either requesting something like a PrintUtf8StringTo
function or the ability to tell googletest to treat all strings as though they're UTF-8, and I'm not quite sure which. In an ideal world, what would you like to see googletest provide?
If you happen to have some simple example code that can illustrate what you'd like to see work, that might clear things up very quickly.
@mbxx The concrete case is EXPECT_EQ(s1, s2)
where s1
and s2
are two std::string<char>
holding strings encoded in UTF-8.
vs
I got directed to this issue from a tweet (https://twitter.com/gpakosz/status/1062632944576184320).
In general, there is no portable way to directly write UTF-8 to a terminal/console. On POSIX systems, writing the code units will work if the LC_CTYPE
(or LC_ALL
or LANG
) environment variable is set to select a UTF-8 enabled locale and the terminal emulator is appropriately configured to expect UTF-8 encoded text and has an appropriate font selected. On Windows, there is no general solution right now, though Microsoft is working on a solution (https://blogs.msdn.microsoft.com/commandline/2018/11/15/windows-command-line-unicode-and-utf-8-output-text-buffer/).
For now, the best approach is to honor the user's locale settings and transcode UTF-8 text to the execution encoding prior to streaming it to the terminal/console. Doing otherwise effectively results in mojibake since the terminal/console is not expecting UTF-8 unless specifically configured for it.
Unfortunately, C and C++ make conversion from UTF-8 painful at present. Sticking to standard provided interfaces means using codecvt<char16_t, char, mbstate_t>
to convert from UTF-8 to UTF-16 and then using c16rtomb
(using an implementation that implements C11's DR488) to convert from UTF-16 to the execution encoding.
C++20 will add new mbrtoc8
and c8rtomb
functions to enable direct conversion between UTF-8 and the execution encoding as proposed in P0482.
SG16 intends to standardize new transcoding interfaces for C++, hopefully for C++23.
@tahonermann: That's great information, thank you!
I think there's still an open design question for the GoogleTest maintainers (which I'm not one of, I work on Abseil): can GoogleTest assume that all std::string
s are UTF-8 when the user's local supports UTF-8? I don't think it's a clear cut answer because the programmer may be using the std::string
to hold a raw sequence of bytes, and those bytes may not be legal UTF-8.
I'm not familiar with how GoogleTest is implemented, but naively speaking, one possible solution would be to give the programmer the ability to explicitly state that some std::string
is UTF-8. It's probably possible to do this with a thin wrapper class:
EXPECT_EQ(Utf8String(reference), Utf8String(candidate));
SG16 advocates adopting an internal encoding model in which text is converted (based on locale settings) on input/output to/from a character encoding that is known at compile-time, see P1238. C++20 will have a new char8_t
character type as well as a new std::u8string
type alias for a new specialization of std::basic_string
for holding UTF-8 text as adopted via P0482.
I mention this as a way of agreeing with the notion of putting the encoding information into the type system (per the Utf8String
suggestion). Use of the type system will help to catch cases where character encoding conversion is absent.
@gennadiycivil: to summarize, I believe there are two separate but closely related possible feature requests here:
(1) googletest could provide a way for a user to specify how strings are encoded so that they can be printed correctly (2) googletest could treat strings as UTF-8 by default
Either (1) or (2) would address this issue, but they're not mutually exclusive.
@gpakosz: does that sound like an accurate summary to you?
An Utf8String()
wrapper doesn't really help in an established test codebase.
Ideally, googletest would have global state depending on locale and terminal capabilities and then PrintStringTo()
would inspect that at most once per string.
@mbxx Can I suggest a 3rd way to use std::wstring
and wchar_t*
throughout?
I did some experimental changes, and it seems feasible.
I'm too, struggling with outputting Chinese in console and XML report(--gtest_output=xml:a.xml).
The way
PrintStringTo
calls intoPrintCharsAsStringTo
which in turns loops over characters and callsPrintAsCharLiteralTo
makes it impossible to print out an UTF-8 string to an UTF-8 capable terminal.Looping over each byte/char and asking
IsPrintableAscii()
isn't just the way to go and reflects well the sorry state of C++ with respect to Unicode and UTF-8 in particular.