PrintStringTo and PrintAsCharLiteralTo make it impossible to print valid UTF-8 to console

gpakosz commented 6 years ago

The way PrintStringTo calls into PrintCharsAsStringTo which in turns loops over characters and calls PrintAsCharLiteralTo makes it impossible to print out an UTF-8 string to an UTF-8 capable terminal.

Looping over each byte/char and asking IsPrintableAscii() isn't just the way to go and reflects well the sorry state of C++ with respect to Unicode and UTF-8 in particular.

gennadiycivil commented 6 years ago

@gpakosz , thank you, do you have a proposal to consider?

gpakosz commented 5 years ago

Well, we store UTF-8 strings in std::string<char>, because we do everything in UTF-8 and there's no better representation in C++.

Iterating over byte code units, and asking for each one IsPrintableAscii() doesn't make sense.

But since we're not gtest architects, and didn't find a clean way to improve the situation, we just hacked our gtest-printers.cc copy to always "print as is":

template <typename UnsignedChar, typename Char>
static CharFormat PrintAsCharLiteralTo(Char c, ostream* os) {
  switch (static_cast<wchar_t>(c)) {
    case L'\0':
      *os << "\\0";
      break;
    case L'\'':
      *os << "\\'";
      break;
    case L'\\':
      *os << "\\\\";
      break;
    case L'\a':
      *os << "\\a";
      break;
    case L'\b':
      *os << "\\b";
      break;
    case L'\f':
      *os << "\\f";
      break;
    case L'\n':
      *os << "\\n";
      break;
    case L'\r':
      *os << "\\r";
      break;
    case L'\t':
      *os << "\\t";
      break;
    case L'\v':
      *os << "\\v";
      break;
    default:
        *os << static_cast<char>(c);
        return kAsIs;
  }
  return kSpecialEscape;
}

mbxx commented 5 years ago

@gpakosz : It sounds like you're either requesting something like a PrintUtf8StringTo function or the ability to tell googletest to treat all strings as though they're UTF-8, and I'm not quite sure which. In an ideal world, what would you like to see googletest provide?

If you happen to have some simple example code that can illustrate what you'd like to see work, that might clear things up very quickly.

gpakosz commented 5 years ago

@mbxx The concrete case is EXPECT_EQ(s1, s2) where s1 and s2 are two std::string<char> holding strings encoded in UTF-8.

vs

tahonermann commented 5 years ago

I got directed to this issue from a tweet (https://twitter.com/gpakosz/status/1062632944576184320).

In general, there is no portable way to directly write UTF-8 to a terminal/console. On POSIX systems, writing the code units will work if the LC_CTYPE (or LC_ALL or LANG) environment variable is set to select a UTF-8 enabled locale and the terminal emulator is appropriately configured to expect UTF-8 encoded text and has an appropriate font selected. On Windows, there is no general solution right now, though Microsoft is working on a solution (https://blogs.msdn.microsoft.com/commandline/2018/11/15/windows-command-line-unicode-and-utf-8-output-text-buffer/).

For now, the best approach is to honor the user's locale settings and transcode UTF-8 text to the execution encoding prior to streaming it to the terminal/console. Doing otherwise effectively results in mojibake since the terminal/console is not expecting UTF-8 unless specifically configured for it.

Unfortunately, C and C++ make conversion from UTF-8 painful at present. Sticking to standard provided interfaces means using codecvt<char16_t, char, mbstate_t> to convert from UTF-8 to UTF-16 and then using c16rtomb (using an implementation that implements C11's DR488) to convert from UTF-16 to the execution encoding.

C++20 will add new mbrtoc8 and c8rtomb functions to enable direct conversion between UTF-8 and the execution encoding as proposed in P0482.

SG16 intends to standardize new transcoding interfaces for C++, hopefully for C++23.

mbxx commented 5 years ago

@tahonermann: That's great information, thank you!

I think there's still an open design question for the GoogleTest maintainers (which I'm not one of, I work on Abseil): can GoogleTest assume that all std::strings are UTF-8 when the user's local supports UTF-8? I don't think it's a clear cut answer because the programmer may be using the std::string to hold a raw sequence of bytes, and those bytes may not be legal UTF-8.

I'm not familiar with how GoogleTest is implemented, but naively speaking, one possible solution would be to give the programmer the ability to explicitly state that some std::string is UTF-8. It's probably possible to do this with a thin wrapper class:

EXPECT_EQ(Utf8String(reference), Utf8String(candidate));

tahonermann commented 5 years ago

SG16 advocates adopting an internal encoding model in which text is converted (based on locale settings) on input/output to/from a character encoding that is known at compile-time, see P1238. C++20 will have a new char8_t character type as well as a new std::u8string type alias for a new specialization of std::basic_string for holding UTF-8 text as adopted via P0482.

I mention this as a way of agreeing with the notion of putting the encoding information into the type system (per the Utf8String suggestion). Use of the type system will help to catch cases where character encoding conversion is absent.

mbxx commented 5 years ago

@gennadiycivil: to summarize, I believe there are two separate but closely related possible feature requests here:

(1) googletest could provide a way for a user to specify how strings are encoded so that they can be printed correctly (2) googletest could treat strings as UTF-8 by default

Either (1) or (2) would address this issue, but they're not mutually exclusive.

@gpakosz: does that sound like an accurate summary to you?

gpakosz commented 5 years ago

An Utf8String() wrapper doesn't really help in an established test codebase.

Ideally, googletest would have global state depending on locale and terminal capabilities and then PrintStringTo() would inspect that at most once per string.

kingsimba commented 5 years ago

@mbxx Can I suggest a 3rd way to use std::wstring and wchar_t* throughout? I did some experimental changes, and it seems feasible. I'm too, struggling with outputting Chinese in console and XML report(--gtest_output=xml:a.xml).

google / googletest

PrintStringTo and PrintAsCharLiteralTo make it impossible to print valid UTF-8 to console #1957