Closed QuLogic closed 6 years ago
Any chance you can share the output of e.g.
> sourcetools::tokenize_string("\u9b3c")
value row column type
1 鬼 1 1 symbol
I don't have a system with one of these architectures to test yet, so it's hard for me to guess what's going on. Potentially relevant bits of code:
The result is (on ppc64 and ppc64le):
> sourcetools::tokenize_string("\u9b3c")
value row column type
1 \xe9 1 1 invalid
2 \xac 1 2 invalid
3 \xbc 1 3 invalid
Ah, I see isValidForStartOfRSymbol
and isValidForRSymbol
use char ch
and check if ch < 0
, but char
is unsigned on at least ppc64(le), so this check would never be true. The C standard does not specify whether char
is signed or unsigned, so that should not be relied upon here.
Doh! You are of course correct. I'll fix that up.
The released 0.1.7 works perfectly.
On systems that are neither x86_64 nor i686, the "tokenizer accepts UTF-8 symbols" test fails. See this build: aarch64, armv7hl, ppc64, ppc64le, s390x all fail. I looked through the logs for any warnings, but there's only one about inconsistent indent.
I would guess there's some assumption about endianness or bitsize or something, but I'm not sure where to look.