I ran into a few issues when trying to parse the IFC express schema.
One being that something like TrueNorth is tried to be parsed (e.g. in an expression rule) as a boolean literal, because the literal rule has higher priority than the simple_id rule, but obviously after changing the order of these rules made something like True not a boolean literal anymore.
So I think there are two options to solve this in its core issue (I think it's just a sign for further issues that may arise because of ambiguous parsing):
Either all the basic parsing rules check that they are not another basic parsing rule (e.g. simple_id checks that it doesn't contain e.g. literals or other things that may also be a simple_id) or use a separate lexer/tokenizer that weeds these cases out already.
I personally prefer using a lexer, it's easier to restrict the problem space/abstract the parser on top of that, because I also have had issues with weird parsing ambiguities in the past when not using a separate lexer (in way simpler languages). I think the BNF grammar of STEP and EXPRESS should allow tokenizing/lexing the whole input without having to think about modal lexing etc. but I'm not sure yet.
I have actually started writing a parser/lexer for the express language, I'm not sure yet, if I will progress this project much further though (I guess I underestimated the scope of supporting STEP completely).
My original motivation was having better error recovery/messages (by using something like chumsky as parser combinator library).
Hi,
I ran into a few issues when trying to parse the IFC express schema.
One being that something like
TrueNorth
is tried to be parsed (e.g. in anexpression
rule) as a boolean literal, because theliteral
rule has higher priority than thesimple_id
rule, but obviously after changing the order of these rules made something likeTrue
not a boolean literal anymore. So I think there are two options to solve this in its core issue (I think it's just a sign for further issues that may arise because of ambiguous parsing):Either all the basic parsing rules check that they are not another basic parsing rule (e.g.
simple_id
checks that it doesn't contain e.g. literals or other things that may also be asimple_id
) or use a separate lexer/tokenizer that weeds these cases out already.I personally prefer using a lexer, it's easier to restrict the problem space/abstract the parser on top of that, because I also have had issues with weird parsing ambiguities in the past when not using a separate lexer (in way simpler languages). I think the BNF grammar of STEP and EXPRESS should allow tokenizing/lexing the whole input without having to think about modal lexing etc. but I'm not sure yet.
I have actually started writing a parser/lexer for the express language, I'm not sure yet, if I will progress this project much further though (I guess I underestimated the scope of supporting STEP completely). My original motivation was having better error recovery/messages (by using something like chumsky as parser combinator library).
I think the lexer is almost complete, so you may be interested in this: https://github.com/Philipp-M/express-parser/blob/6464b29e5eb14d70b0445b84567ed58fdfd144b6/src/lexer.rs