Open arsdragonfly opened 4 years ago
Hopefully this should just be a simple change in the lexer. PR's welcome!
So, I looked into this and I think I found the fix, but Alex might need to release a bug fix first.
I saved the sample from the linked issue as a UTF-8 file:
# 1 "test.c"
# 1 "<built-in>"
# 1 "<命令行>"
# 31 "<命令行>"
# 1 "/usr/include/stdc-predef.h" 1 3 4
# 32 "<命令行>" 2
# 1 "test.c"
int main()
{
return 0;
}
And sure enough got Prelude.head: empty list
. The error comes from the second usage of head
at this location, and is caused by the first non-ASCII line # 1 "<命令行>"
.
Basically the problem is that Alex is assuming the input bytestring is UTF-8, but the InputStream
is a byte-by-byte abstraction (effectively Latin-1). In these lines:
\#$space*@digits$space*(\"($infname|@charesc)*\"$space*)?(@int$space*)*\r?$eol
{ \pos len str -> setPos (adjustLineDirective len (takeChars len str) pos) >> lexToken' False }
Alex is passing 12 for len
, which is the correct Unicode codepoint length of # 1 "<命令行>"
plus a newline at the end. But takeChars
then takes 12 bytes off the bytestring, so adjustLineDirective
receives a broken string which does not include the double quote at the end.
The correct fix is to put Alex back into Latin-1 mode (my impression is that this was the default previously, but was then switched in Alex 3.0). This is done with the %encoding "latin1"
directive (added in Alex 3.1.7). However, it still doesn't work because there was a remaining bug in character counting that caused it to still pass the too-short length. This was fixed in https://github.com/simonmar/alex/pull/156 but even though that was merged a year ago it appears to not have made it into the recent Alex 3.2.6. So, I'll ping that to see when it can be released.
see this issue