usethesource / rascal

The implementation of the Rascal meta-programming language (including interpreter, type checker, parser generator, compiler and JVM based run-time system)
http://www.rascal-mpl.org
Other
399 stars 78 forks source link

char-class type reificiation does something wrong for high surrogate/low surrogate pairs #2009

Open jurgenvinju opened 1 month ago

jurgenvinju commented 1 month ago
rascal>charAt("🍝", 0)
int: 127837
rascal>char(127837)
[🍝]: ([🍝]) `🍝`
rascal>#[🍝]
type[[?]]: type(
  \char-class([range(55356,55356)]),...

So the unicode codepoint 127837 is the right codepoint for 🍝, but type-reification in a character class type turns it into codepoint 55356 whch does not even have a graphical representation in the current font: ?. Maybe it's not even a codepoint.

jurgenvinju commented 1 month ago

This goes wrong in the runtime system of the interpreter which shares a lot of code with the runtime code for the compiler, so probably this breaks everywhere.

DavyLandman commented 1 month ago

@jurgenvinju as it's represented as a surrogate pair, what tends to go wrong is that you only see the high surrogate part of the pair. 🍝 is encoded as 0xD83C & 0xDF5D. 55356 in hex is 0xD83C.

So the bug is: somewhere the type generation takes a java string and gets the first char (charAt(0) I assume), instead of correctly using codepointAt(0).

jurgenvinju commented 1 month ago

Then this is the cause: https://github.com/usethesource/rascal/blob/a23db0e94de06ecdd82963757c10ce087ff44a23/src/org/rascalmpl/values/parsetrees/SymbolFactory.java#L363