utwente-fmt / vercors

The VerCors verification toolset for verifying parallel and concurrent software
https://utwente.nl/vercors
Mozilla Public License 2.0
58 stars 26 forks source link

CPPToCol assumes integer value of character literals equals Unicode code point in C++ #1243

Closed wandernauta closed 2 months ago

wandernauta commented 2 months ago

The CPPToCol transform uses the Unicode code point value of a C++ ordinary character literal to come up with a corresponding integer value. This works in many cases (Latin alphabet, default compiler flags) but is not guaranteed by the spec. Specifically:

I don't estimate that many actual programs will run into this, but I've put some contrived examples below; a possible way to resolve this would be to restrict ordinary character literals to a single ordinary character, and documenting that the execution charset is assumed to be UTF-8.


The following is ill-formed, but verifies as C++:

//@ ensures \result == 1;
int main() {
  char ch = 'œ';
  return (ch == 339) ? 1 : 2;
}

VerCors sees a single character œ, it has Unicode code point value 0x153, which is 339 decimal, so we pick the true branch and the postcondition holds. g++ on my machine instead sees...

  1. The œ part is stored as two bytes, 0xc5 and 0x93.
  2. This makes it a multi-character character constant, so an int (32 bits on my machine).
  3. g++ decides to give it the implementation-defined value 0x0000c593 or 50579 decimal.
  4. This would overflow char, which g++ has decided is 8 bit signed; g++ decides to wrap to -109.
  5. The value -109 cannot equal to 339, so the function is equivalent to return 2; and compiled as such.

Another implementation might decide they prefer the value 42 instead.

As another example, this program is valid C++ and verifies, but whether the postcondition holds depends on the selected execution charset:

//@ ensures \result == 1;
int main() {
  return ('e' == 101) ? 1 : 2;
}
$ g++ example.cpp && ./a.out
$ echo $?
1
$ g++ -fexec-charset=IBM037 example.cpp && ./a.out
$ echo $?
2
bobismijnnaam commented 2 months ago

a possible way to resolve this would be to restrict ordinary character literals to a single ordinary character, and documenting that the execution charset is assumed to be UTF-8.

I agree. If anyone wants a more insane contract they can write a proposal :)