kevinushey / sourcetools

Tools for reading, tokenizing, and parsing R code.
MIT License
77 stars 3 forks source link

Test failure on non-x86 systems #21

Closed QuLogic closed 6 years ago

QuLogic commented 6 years ago

On systems that are neither x86_64 nor i686, the "tokenizer accepts UTF-8 symbols" test fails. See this build: aarch64, armv7hl, ppc64, ppc64le, s390x all fail. I looked through the logs for any warnings, but there's only one about inconsistent indent.

I would guess there's some assumption about endianness or bitsize or something, but I'm not sure where to look.

kevinushey commented 6 years ago

Any chance you can share the output of e.g.

> sourcetools::tokenize_string("\u9b3c")
  value row column   type
1    鬼   1      1 symbol

I don't have a system with one of these architectures to test yet, so it's hard for me to guess what's going on. Potentially relevant bits of code:

https://github.com/kevinushey/sourcetools/blob/0144f2b94a5d3a169103a3533c65705f450c1337/inst/include/sourcetools/tokenization/Tokenizer.h#L408-L409

https://github.com/kevinushey/sourcetools/blob/0144f2b94a5d3a169103a3533c65705f450c1337/inst/include/sourcetools/core/util.h#L110-L116

QuLogic commented 6 years ago

The result is (on ppc64 and ppc64le):

> sourcetools::tokenize_string("\u9b3c")
  value row column    type
1  \xe9   1      1 invalid
2  \xac   1      2 invalid
3  \xbc   1      3 invalid
QuLogic commented 6 years ago

Ah, I see isValidForStartOfRSymbol and isValidForRSymbol use char ch and check if ch < 0, but char is unsigned on at least ppc64(le), so this check would never be true. The C standard does not specify whether char is signed or unsigned, so that should not be relied upon here.

kevinushey commented 6 years ago

Doh! You are of course correct. I'll fix that up.

QuLogic commented 6 years ago

The released 0.1.7 works perfectly.