timtadh / lexmachine

Lex machinary for go.
Other
405 stars 28 forks source link

Support unicode in regexp #36

Closed hlindberg closed 4 years ago

hlindberg commented 4 years ago

I would like to be able to define an ID to include international characters, symbols and emoji. Did not find any mention how to define anything like that with the regular expressions. In go regexp this would mean using a unicode character class \p{L}.

timtadh commented 4 years ago

Lexmachine operates on bytes not on code points. However, that doesn't stop you from lexing the byte representation (say UTF-8) of the unicode string you want to lex. It is very unlikely that I will add support for operating on code points or runes instead of on bytes.

hlindberg commented 4 years ago

Thanks for the quick answer. When you say I can lex byte representation of UTF-8, I assume you mean that the lexer can lex such content but that I cannot have a regular expressions with multibyte characters in them. Is that correct, or do you mean I can somehow express that as a sequence of bytes in the regexp? If the later, do I just have one in the regexp (for example 😀) ?

timtadh commented 4 years ago

Sorry for the slow reply I lost track of this. You can use a byte encoding in both the regular expressions and the text content. However, the encoding must match. Eg. if you are lexing UTF-8 you have to use UTF-8 expressions, if you are lexing UTF-16 you have to use UTF-16 expressions. I do not support native unicode lexing. Ok quick example:

lexer.Add([]byte(`☃`), token("UNICODE-SNOWMAN"))
hlindberg commented 4 years ago

ok, got it, thanks - so individual multibyte chars would work but as they lex as a sequence of bytes it would not be possible to express ranges.

timtadh commented 4 years ago

That is correct. You would have to write a script to convert the code point range into an alternation ([bytes-for-code-point-1]|[bytes-for-code-point-2]|....|[bytes-for-code-point-n). Once compiled to a minimized DFA it will be just as efficient from a lexing standpoint as a if you could have written it using the regex range operator. Much less convenient to write however and the time to compile to the DFA is probably slower.

One advantage (beside implementation simplicity) of sticking to bytes is it can be reasonably fast when doing DFA based matching without code generation. I think if we did unicode in lexmachine we would need code generation as you can no longer represent the DFA as [][256]int and would have to use a []map[rune]int which is much slower and uses a lot more memory -- which means code generation is basically a must.

hlindberg commented 4 years ago

@timtadh Thanks for taking the time to explain - I did figure out I could generate ranges like that, but I did not look closely at the implementation of lexmachine to see how hard it would be to add UTF-8 support - I now understand it is a fundamental change.

timtadh commented 4 years ago

Ok sounds good. Let me know if you have any other feedback. I am going to close the issue for now.