Use separate Tokenizer/Lexer?

Hi,

I ran into a few issues when trying to parse the IFC express schema.

One being that something like TrueNorth is tried to be parsed (e.g. in an expression rule) as a boolean literal, because the literal rule has higher priority than the simple_id rule, but obviously after changing the order of these rules made something like True not a boolean literal anymore. So I think there are two options to solve this in its core issue (I think it's just a sign for further issues that may arise because of ambiguous parsing):

Either all the basic parsing rules check that they are not another basic parsing rule (e.g. simple_id checks that it doesn't contain e.g. literals or other things that may also be a simple_id) or use a separate lexer/tokenizer that weeds these cases out already.

I personally prefer using a lexer, it's easier to restrict the problem space/abstract the parser on top of that, because I also have had issues with weird parsing ambiguities in the past when not using a separate lexer (in way simpler languages). I think the BNF grammar of STEP and EXPRESS should allow tokenizing/lexing the whole input without having to think about modal lexing etc. but I'm not sure yet.

I have actually started writing a parser/lexer for the express language, I'm not sure yet, if I will progress this project much further though (I guess I underestimated the scope of supporting STEP completely). My original motivation was having better error recovery/messages (by using something like chumsky as parser combinator library).

I think the lexer is almost complete, so you may be interested in this: https://github.com/Philipp-M/express-parser/blob/6464b29e5eb14d70b0445b84567ed58fdfd144b6/src/lexer.rs

ricosjp / ruststep

Use separate Tokenizer/Lexer? #241