kevinushey / sourcetools

Tools for reading, tokenizing, and parsing R code.
MIT License
77 stars 3 forks source link

implement proper UTF8 handling #13

Closed kevinushey closed 8 years ago

kevinushey commented 8 years ago

Right now, the tokenizer proceeds byte by byte, and just assumes bytes with the upper bit set are part of a UTF-8 character, which is assumed to be valid for use as a symbol.

However, there are a large number of UTF-8 whitespace characters, and the R parser accepts these as delimiters. E.g.

> parse(text = "\u{00A0}")
expression()

Thoughts on how to handle:

  1. Make use of the standard library routines ::mbstowcs(), ::wcstombs().
  2. Use R's own iconv functionality as appropriate.

Places to get inspiration:

https://github.com/wch/r-source/blob/d878101b5239cb4a5fa63da3e0d11b52a0cecba1/src/main/gram.c#L271-L308

https://github.com/wch/r-source/blob/d878101b5239cb4a5fa63da3e0d11b52a0cecba1/src/main/gram.c#L4608-L4631

kevinushey commented 8 years ago

For now, I think I'm just going to force ASCII whitespace, until someone requests the expanded functionality.