Closed abt8601 closed 3 years ago
This does make me wonder, is alexGetByte
even the right layer of abstraction? Might it be better to pop a whole character?
With this change, both the String
(native pop char) and ByteString
(native pop byte) are complex, and rightfully so. With alexGetChar
, the String
could become trivial, and the ByteString
case basically becomes no worse, since, as you demonstrate, we already need to track what byte of the character we're at anyways.
What do you think?
I think having alexGetChar
makes things simpler even on ByteString
, since we wouldn't even need to track which byte we're at of the current character in AlexInput
.
I guess the reason for having alexGetByte
is performance, since Alex internally uses UTF-8 encoded byte sequence, as stated in the documentation.
I'll have to ponder more what "internally uses UTF-8" means. Maybe @alanz or @jyp remember something from writing 892688ff71bcc4b74313f7348ee5dace73dd8506 a decade ago? :D
It's a long time ago :)
IIRC, getByte
accumulates input one Word8
at a time, and only cranks the state machine when it hits a character boundary. So basically it does the [Word8] ->
Char` conversion. And because the commit talks about the NFA blowing up in size and needing to be minimized, I think this may be pushed right into the generated DFA too.
I am not sure if there was proper unicode support in Char
at the time.
Either way, GHC parses from a StringBuffer
, so I think it needs the unicode processing for a list of bytes.
My brain dump.
Thanks!
I am not sure if there was proper unicode support in `Char at the time.
Ah, wonderful, this is just the thing I was hoping to hear. Yes I am getting more sure we should just be outlining the UTF-8 state machine at this point. We could even have support other encodings that way.
(It's wonderful how regular languages serially compose, I only wish someone would do the research so we can do the same with context free ones!)
Fixes #53
In the original implementation of the bytestring wrappers,
alexGetByte
maintains the last seen byte instead of the last seen character. This causes the left context to cease proper function. This patch introduces a fix of the issue.Since I have to change the structure of the
AlexInput
type, this is a breaking change.