llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
28.68k stars 11.86k forks source link

libc++ locales that compare equal but behave differently #53797

Open hubert-reinterpretcast opened 2 years ago

hubert-reinterpretcast commented 2 years ago

libc++ does not normalize the name of constructed locales, thus allowing locales with names that are not self-encapsulated to be created. Such locales then behave differently but compare equal.

It can also be imagined that such names would cause calls to locale::global() to set the C locale in a way that behaves differently than the locale passed to the call.

Online compiler link: https://godbolt.org/z/YfobMx41o

Source (<stdin>):

#define _POSIX_C_SOURCE 200112L
#include <cassert>
#include <ctype.h>
#include <locale>
#include <stdio.h>
#include <stdlib.h>
int main(void) {
  setenv("LC_ALL", "en_US.UTF-8", 1);
  std::locale America("");

  setenv("LC_ALL", "C", 1);
  std::locale Generic("");

  const char str[] = "@0Aa";
  const auto morphed =
      std::use_facet<std::collate<char>>(America).transform(str, str + 4);
  const auto genericized =
      std::use_facet<std::collate<char>>(Generic).transform(str, str + 4);
  const auto printbin = [](const auto &str) {
    for (char c : str) {
      if (!::isprint(c)) {
        fprintf(stderr, "\\x%02x", (unsigned)(unsigned char)c);
        continue;
      }
      fprintf(stderr, "%c", c);
    }
    fprintf(stderr, "\n");
  };
  printbin(morphed);
  printbin(genericized);
  assert((America != Generic) || (morphed == genericized));
}

Compiler invocation:

clang++ -stdlib=libc++ -Wall -Wextra -pedantic-errors -o a.out -x c++ -

Run invocation command:

./a.out

Actual run output:

GQQ\x01\x02\x02\x02\x01\x02\x07\x02\x01\x01\xc7\xa5\x01\xe2\x82\x9b\x01\xe2\xab\x8e\x01\xe2\x94\x92
@0Aa
a.out: <stdin>:31: int main(): Assertion `(America != Generic) || (morphed == genericized)' failed.

Expected run output:

GQQ\x01\x02\x02\x02\x01\x02\x07\x02\x01\x01\xc7\xa5\x01\xe2\x82\x9b\x01\xe2\xab\x8e\x01\xe2\x94\x92
@0Aa

Compiler version info (clang++ -v):

clang version 15.0.0 (https://github.com/llvm/llvm-project.git bf2f72fa10e3469b4f1bc6a85129c7074c65cfb2)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/wandbox/clang-head/bin
Found candidate GCC installation: /usr/lib/gcc/x86_64-linux-gnu/9
Selected GCC installation: /usr/lib/gcc/x86_64-linux-gnu/9
Candidate multilib: .;@m64
Selected multilib: .;@m64
jsonn commented 2 years ago

I would say that if you mix setenv and using the "" locale name, you get what you are asking for. IMO, libc++ behaves perfectly sensible here.