Fix left context with UTF-8 input in bytestring wrappers

haskell / alex

A lexical analyser generator for Haskell

https://hackage.haskell.org/package/alex

BSD 3-Clause "New" or "Revised" License

298 stars 82 forks source link

Fix left context with UTF-8 input in bytestring wrappers #165

Closed abt8601 closed 3 years ago

abt8601 commented 4 years ago

Fixes #53

In the original implementation of the bytestring wrappers, alexGetByte maintains the last seen byte instead of the last seen character. This causes the left context to cease proper function. This patch introduces a fix of the issue.

Since I have to change the structure of the AlexInput type, this is a breaking change.

Ericson2314 commented 3 years ago

This does make me wonder, is alexGetByte even the right layer of abstraction? Might it be better to pop a whole character?

With this change, both the String (native pop char) and ByteString (native pop byte) are complex, and rightfully so. With alexGetChar, the String could become trivial, and the ByteString case basically becomes no worse, since, as you demonstrate, we already need to track what byte of the character we're at anyways.

What do you think?

abt8601 commented 3 years ago

I think having alexGetChar makes things simpler even on ByteString, since we wouldn't even need to track which byte we're at of the current character in AlexInput.

I guess the reason for having alexGetByte is performance, since Alex internally uses UTF-8 encoded byte sequence, as stated in the documentation.

Ericson2314 commented 3 years ago

I'll have to ponder more what "internally uses UTF-8" means. Maybe @alanz or @jyp remember something from writing 892688ff71bcc4b74313f7348ee5dace73dd8506 a decade ago? :D

alanz commented 3 years ago

It's a long time ago :)

IIRC, getByte accumulates input one Word8 at a time, and only cranks the state machine when it hits a character boundary. So basically it does the [Word8] ->Char` conversion. And because the commit talks about the NFA blowing up in size and needing to be minimized, I think this may be pushed right into the generated DFA too.

I am not sure if there was proper unicode support in Char at the time.

Either way, GHC parses from a StringBuffer, so I think it needs the unicode processing for a list of bytes.

My brain dump.

Ericson2314 commented 3 years ago

Thanks!

I am not sure if there was proper unicode support in `Char at the time.

Ah, wonderful, this is just the thing I was hoping to hear. Yes I am getting more sure we should just be outlining the UTF-8 state machine at this point. We could even have support other encodings that way.

(It's wonderful how regular languages serially compose, I only wish someone would do the research so we can do the same with context free ones!)