Support lexing non-ascii codepoints

glguy commented 9 years ago

Accomplish unicode lexing by using the method GHC uses where non-ascii characters are treated as their character class instead or working with utf-8 encoded strings

Inline the definition of readMay to avoid an extra library dependency

glguy commented 9 years ago

The code as written here only produces ASCII string literals. Lua strings are byte-arrays. It tolerates Unicode. \u{XXX} is the UTF-8 encoding of the given hex encoded codepoint. When files are saved in UTF-8 (and when aren't they??) the non-ascii characters found in string literal syntax will be left in UTF-8 encoding. Haskell decodes UTF-8 now (hurray!) but it means that we need to put the UTF-8 back into the string literals so that the meaning of the code doesn't change.

*Language.Lua> let Right x = parseText chunk "local x = '\\u{1000}\"'" in pprint x
local x = '\xe1\x80\x80"'

The pretty-printer as updated now always produces valid Lua strings. Printable ASCII characters are left in places, the Lua escapes "abfnrtv\"'" are all used when possible. Otherwise the hex encoding is used: \xXX. The hex encoding is preferred because it is fixed length and does not risk merging with the following characters (e.g. "\01".."0")

The pretty-printer has been updated to try to use single-quotes for string literals when this would reduce the number of escapes needed.

The UTF-8 encoding functions was copied out of my utf8-string library, so it has a history of producing correct output. It seemed unnecessary to add an extra dependency for such a small function.

glguy commented 9 years ago

The escapes for single- and double-quoted strings are the same. Since we no longer need to use Haskell's string reading function we don't need two cases for them

osa1 commented 9 years ago

Could you add some unit tests for printing parsed string literals? To be more specific, we need to make sure that pretty-printed literals are same as literals we read.

As a test maybe we can do something like:

Copy some of the string literals to a file, like I did in tests/strings.
For each line use call lexer's readString to interpret the literal.
Use pretty-printer's showStringLiteral to generate uninterpreted version again.
When printed to a utf-8 file, this should generate the same result as the original file. (we can just compare bytes)

glguy commented 9 years ago

We can't round-trip string literals (and we couldn't before, either) There are too many ways to write the same thing that we lose when we interpret them. If we change the code to not interpret the string literals as discussed earlier then there will be nothing to test, either.

Once it is decided how string literals should be treated I can add some example cases of them to the tests, though

I fixed the Applicative import.

Unicode is embedded into a single byte from alex's point of view as it pertains to advancing the lexer state machine. The original character is preserved in the input stream.

glguy commented 9 years ago

Now string literals are left in the syntax tree uninterpreted. I still need to build some tests for round-tripping and for testing string literal construction. String literals decoding tests are already in the tests file, however.

glguy commented 9 years ago

Things are in a pretty good state now. We have tests for constructing literals, interpreting them, and pretty printing them!

osa1 commented 9 years ago

Awesome job @glguy, thanks.

I'm super busy until next week, how urgent is this? If you need a Hackage release etc. I guess I can just merge this but if you're not in a hurry it may take a few days until I review.

Also, you keep making this even better but are you done now or do you have any other plans? Now that we have string interpretation and printing working I guess this is done now?

glguy commented 9 years ago

I think that this line of work is done. I don't have any immediate plans, but I don't think this will be my last contribution.

I'm not in a hurry to get a hackage release. I'm using this code locally just fine, so you don't need to rush anything on my behalf.

osa1 / language-lua

Support lexing non-ascii codepoints #27