CPPToCol assumes integer value of character literals equals Unicode code point in C++

The CPPToCol transform uses the Unicode code point value of a C++ ordinary character literal to come up with a corresponding integer value. This works in many cases (Latin alphabet, default compiler flags) but is not guaranteed by the spec. Specifically:

The actual integer value depends on what the "execution character set" is, which can be set at compile time: to EBCDIC on mainframes, and so on.
Compilers are allowed to interpet the (otherwise invalid) case of multiple chars in one character literal as a "multicharacter literal" with some implementation-defined value, a feature inherited by C++ from C, by C from B, and predating Unicode.

I don't estimate that many actual programs will run into this, but I've put some contrived examples below; a possible way to resolve this would be to restrict ordinary character literals to a single ordinary character, and documenting that the execution charset is assumed to be UTF-8.

The following is ill-formed, but verifies as C++:

//@ ensures \result == 1;
int main() {
  char ch = 'œ';
  return (ch == 339) ? 1 : 2;
}

VerCors sees a single character œ, it has Unicode code point value 0x153, which is 339 decimal, so we pick the true branch and the postcondition holds. g++ on my machine instead sees...

The œ part is stored as two bytes, 0xc5 and 0x93.
This makes it a multi-character character constant, so an int (32 bits on my machine).
g++ decides to give it the implementation-defined value 0x0000c593 or 50579 decimal.
This would overflow char, which g++ has decided is 8 bit signed; g++ decides to wrap to -109.
The value -109 cannot equal to 339, so the function is equivalent to return 2; and compiled as such.

Another implementation might decide they prefer the value 42 instead.

As another example, this program is valid C++ and verifies, but whether the postcondition holds depends on the selected execution charset:

//@ ensures \result == 1;
int main() {
  return ('e' == 101) ? 1 : 2;
}

$ g++ example.cpp && ./a.out
$ echo $?
1
$ g++ -fexec-charset=IBM037 example.cpp && ./a.out
$ echo $?
2

utwente-fmt / vercors

CPPToCol assumes integer value of character literals equals Unicode code point in C++ #1243