Closed tpluscode closed 10 years ago
Hi,
Yeah, the RepeatParser is certainly greedy. Though you do have options for this scenario:
First, you can enforce the rule that a '.' must follow another character such as:
pn_chars_base & -((pn_chars & '.' & pn_chars) | pn_chars);
Or, if performance is a big concern (i.e. you are parsing millions of these things in a server scenario) you can use the RepeatParser's separator to allow an optional period in-between each character:
pn_chars_base & (-pn_chars).SeparatedBy(~(Parser)'.');
Hope this helps!
Hm, thinking about that for a second, this will also work and is a little cleaner:
pn_chars_base & -(('.' & pn_chars) | pn_chars)
Thanks it works of course. I haven't done any grammars since studies. However the W3C syntax makes me wonder. I understand that the semantics of
(PN_CHARS | '.')* PN_CHARS
mean that matching string contains a PN_CHARS
preceeded by zero or more PN_CHARS
or a dot. However without peeking forward while parsing the repeated group the parser cannot know if any given matched character should actually be matched by the next production in sequence. Hence the greedy behaviour and it is logical.
The question though is, is it a valid EBNF notation and such parsers are actually used? Or is this syntax just more human-readable and indented to be easier to comprehend in written form?
This is valid EBNF notation, and works with LALR parsers, though does not work with LL parsers like Eto.Parse. Being recursive descent, the repeat parser knows nothing about what should come after it. With LALR parsers, a huge 'table of possibilities' is typically created which allows it to handle patterns like this.
I've pondered the concept of adding look-ahead to Eto.Parse and it might be doable, however it may degrade performance which is not what I'd like to see.
Thanks for clarifying this for me.
I certainly won't need the enchanced functonality, given that I was able to achieve the desired result by adjusting my productions.
Hi
I'm trying to create a grammar to parse SPARQL property paths, as defined here. I only need part of that vocabulary. Unfortunately W3C uses their own EBNF syntax so instead of tranlating it to vanilla EBNF I decided to try rewrite the relevant rules using your shortcut syntax, which I find quite neat.
However, I've bumped into problems with rule
PN_PREFIX
.In short,
PN_PREFIX
should match the prefix of a QName URI. For example, given a QNamerdf:type
it would be matching therdf
part. As per the rule, the first character must be a letter, and then additionally characters are allowed.I rewrote PN_PREFIX as
pn_chars_base
matches ther
and then the RepeatParser matchesdf
, which is unfortunate because thenpn_chars
fails, because it doesn't match the colon, thus failing entire optional pattern.The intent is that
pn_chars_base
,pn_chars
inside repeat and the lastpn_chars
matchedr
,d
andf
respectively so that the entirepn_prefix
matchedrdf
.Any idea what's not right with my grammar?