Make DFA match utf8 rather than char

osa1 / lexgen

A fully-featured lexer generator, implemented as a proc macro

MIT License

63 stars 7 forks source link

Make DFA match utf8 rather than char #59

Open seanyoung opened 1 year ago

seanyoung commented 1 year ago

This would improve performance, as no utf8 decoding is necessary.

This is what re2 does too.

osa1 commented 1 year ago

A simple way to implement this would be to replace chars in the DFA (and NFA, code generator) with u8. When a char is encoded as multiple bytes introduce new states to match the whole encoding. E.g. if I have this transition

S1 --('ö')--> S2

(state S1 switches to S2 on character 'ö')

This becomes

S1 --(0xC3)--> Sn --(0xB6)--> S2

(state S1 switches to a new intermediate step Sn on byte 0xC3, which then switches to S2 on byte 0xB6)

A problem with this is that the number of states will increase proportional to a matched string/char's UTF-8 encoding, rather than the number of chars. Maybe that's not too bad though, and I don't know if it's possible to avoid new states.

seanyoung commented 1 year ago

How would we implement this for character classes, e.g. $$XID_Start? If we convert the whole table to an utf8 DFA, it will be huge. However, will it be larger than the char table we have now?

Also, a character class can occur multiple times in a lexer, and we probably don't want to duplicate this for each occurrence (or do we).

osa1 commented 1 year ago

How would we implement this for character classes, e.g. $$XID_Start? If we convert the whole table to an utf8 DFA, it will be huge. However, will it be larger than the char table we have now?

Good point, I think regexes like $$XID_Start will probably make things unacceptably slow in compile time and cause huge code generation.

Also, a character class can occur multiple times in a lexer, and we probably don't want to duplicate this for each occurrence (or do we).

Currently we can't really reuse DFAs (or states) as each DFA/state will have a different next state or semantic action (i.e. continuation).

For example, we have "match A then do B" and "match A then do C", the "match A" machines cannot be reused as each machine will do something different after a successful match.

I guess we should keep the DFA and NFA the same and find another way. Do you know how re2 doing this?

Btw, for runtime performance, I think this change will probably be a micro-optimization compared to DFA minimization (#38). We should do that first. It will also reduce code size.