visq / language-c

Source repository for https://hackage.haskell.org/package/language-c
http://visq.github.io/language-c/
Other
87 stars 45 forks source link

GCC preprocessor output generated in non-ASCII locales cannot be processed #72

Open arsdragonfly opened 4 years ago

arsdragonfly commented 4 years ago

see this issue

expipiplus1 commented 4 years ago

Hopefully this should just be a simple change in the lexer. PR's welcome!

mtolly commented 3 years ago

So, I looked into this and I think I found the fix, but Alex might need to release a bug fix first.

I saved the sample from the linked issue as a UTF-8 file:

# 1 "test.c"
# 1 "<built-in>"
# 1 "<命令行>"
# 31 "<命令行>"
# 1 "/usr/include/stdc-predef.h" 1 3 4
# 32 "<命令行>" 2
# 1 "test.c"
int main()
{
 return 0;
}

And sure enough got Prelude.head: empty list. The error comes from the second usage of head at this location, and is caused by the first non-ASCII line # 1 "<命令行>".

Basically the problem is that Alex is assuming the input bytestring is UTF-8, but the InputStream is a byte-by-byte abstraction (effectively Latin-1). In these lines:

\#$space*@digits$space*(\"($infname|@charesc)*\"$space*)?(@int$space*)*\r?$eol
  { \pos len str -> setPos (adjustLineDirective len (takeChars len str) pos) >> lexToken' False }

Alex is passing 12 for len, which is the correct Unicode codepoint length of # 1 "<命令行>" plus a newline at the end. But takeChars then takes 12 bytes off the bytestring, so adjustLineDirective receives a broken string which does not include the double quote at the end.

The correct fix is to put Alex back into Latin-1 mode (my impression is that this was the default previously, but was then switched in Alex 3.0). This is done with the %encoding "latin1" directive (added in Alex 3.1.7). However, it still doesn't work because there was a remaining bug in character counting that caused it to still pass the too-short length. This was fixed in https://github.com/simonmar/alex/pull/156 but even though that was merged a year ago it appears to not have made it into the recent Alex 3.2.6. So, I'll ping that to see when it can be released.