Right now, the tokenizer proceeds byte by byte, and just assumes bytes with the upper bit set are part of a UTF-8 character, which is assumed to be valid for use as a symbol.
However, there are a large number of UTF-8 whitespace characters, and the R parser accepts these as delimiters. E.g.
> parse(text = "\u{00A0}")
expression()
Thoughts on how to handle:
Make use of the standard library routines ::mbstowcs(), ::wcstombs().
Right now, the tokenizer proceeds byte by byte, and just assumes bytes with the upper bit set are part of a UTF-8 character, which is assumed to be valid for use as a symbol.
However, there are a large number of UTF-8 whitespace characters, and the R parser accepts these as delimiters. E.g.
Thoughts on how to handle:
::mbstowcs()
,::wcstombs()
.iconv
functionality as appropriate.Places to get inspiration:
https://github.com/wch/r-source/blob/d878101b5239cb4a5fa63da3e0d11b52a0cecba1/src/main/gram.c#L271-L308
https://github.com/wch/r-source/blob/d878101b5239cb4a5fa63da3e0d11b52a0cecba1/src/main/gram.c#L4608-L4631