Open pnotequalnp opened 2 years ago
@pnotequalnp : I haven't looked in detail, but I think Alex generates arrays indexed by 256-bit characters to make swift automata transitions. That wouldn't work with unicode characters for the sheer size of such arrays.
Since text-2 uses UTF-8 byte arrays it should be possible to produce a byte-level automata and even zero-copy token slices.
As the documentation says, Alex works over a stream of UTF-8 encoded bytes, retrieved one at a time by
alexGetByte
.From an external viewpoint as a consumer (I am not familiar with how Alex is implemented), this seems like a strange design decision to me. If my source is already in a UTF-8 format like
String
orData.Text(.Lazy).Text
(with the new text 2.0 release), it seems that in order to run Alex on it, I (or a wrapper) would have to write logic to decode the UTF-8 content into individual bytes, just so that Alex can immediately re-encode it back into UTF-8 internally.So I'm wondering if there's a reason that Alex needs to work over bytes and not
Char
s, or if that was perhaps done to support lexingByteString
s directly, without having to first unpack the data into UTF-8String
s or decode into UTF-16Text
s (with text <2.0), which would be unnecessary overhead either way.If Alex has to work over bytes for internal reasons, I think it would be a good idea to implement new wrappers for the UTF-8
Text
types, since I'd imagine that would be a pretty common use case.Otherwise, would it be possible to expose an
alexGetChar
-based interface that simply skips the UTF-8 decoding portion of Alex's internal logic, which would be more ergonomic and efficient for UTF-8 based types?