haskell / alex

A lexical analyser generator for Haskell
https://hackage.haskell.org/package/alex
BSD 3-Clause "New" or "Revised" License
298 stars 82 forks source link

UTF-8 and text 2.0 #211

Open pnotequalnp opened 2 years ago

pnotequalnp commented 2 years ago

As the documentation says, Alex works over a stream of UTF-8 encoded bytes, retrieved one at a time by alexGetByte.

Lexer specifications are written in terms of Unicode characters, but Alex works internally on a UTF-8 encoded byte sequence.

Depending on how you use Alex, the fact that Alex uses UTF-8 encoding internally may or may not affect you. If you use one of the wrappers (below) that takes input from a Haskell String, then the UTF-8 encoding is handled automatically. However, if you take input from a ByteString, then it is your responsibility to ensure that the input is properly UTF-8 encoded.

From an external viewpoint as a consumer (I am not familiar with how Alex is implemented), this seems like a strange design decision to me. If my source is already in a UTF-8 format like String or Data.Text(.Lazy).Text (with the new text 2.0 release), it seems that in order to run Alex on it, I (or a wrapper) would have to write logic to decode the UTF-8 content into individual bytes, just so that Alex can immediately re-encode it back into UTF-8 internally.

So I'm wondering if there's a reason that Alex needs to work over bytes and not Chars, or if that was perhaps done to support lexing ByteStrings directly, without having to first unpack the data into UTF-8 Strings or decode into UTF-16 Texts (with text <2.0), which would be unnecessary overhead either way.

If Alex has to work over bytes for internal reasons, I think it would be a good idea to implement new wrappers for the UTF-8 Text types, since I'd imagine that would be a pretty common use case.

Otherwise, would it be possible to expose an alexGetChar-based interface that simply skips the UTF-8 decoding portion of Alex's internal logic, which would be more ergonomic and efficient for UTF-8 based types?

andreasabel commented 2 years ago

@pnotequalnp : I haven't looked in detail, but I think Alex generates arrays indexed by 256-bit characters to make swift automata transitions. That wouldn't work with unicode characters for the sheer size of such arrays.

dpwiz commented 2 years ago

Since text-2 uses UTF-8 byte arrays it should be possible to produce a byte-level automata and even zero-copy token slices.