implement proper UTF8 handling

Right now, the tokenizer proceeds byte by byte, and just assumes bytes with the upper bit set are part of a UTF-8 character, which is assumed to be valid for use as a symbol.

However, there are a large number of UTF-8 whitespace characters, and the R parser accepts these as delimiters. E.g.

> parse(text = "\u{00A0}")
expression()

Thoughts on how to handle:

Make use of the standard library routines ::mbstowcs(), ::wcstombs().
Use R's own iconv functionality as appropriate.

Places to get inspiration:

https://github.com/wch/r-source/blob/d878101b5239cb4a5fa63da3e0d11b52a0cecba1/src/main/gram.c#L271-L308

https://github.com/wch/r-source/blob/d878101b5239cb4a5fa63da3e0d11b52a0cecba1/src/main/gram.c#L4608-L4631

kevinushey / sourcetools

implement proper UTF8 handling #13