Greedy repeat parser problem

tpluscode commented 10 years ago

Hi

I'm trying to create a grammar to parse SPARQL property paths, as defined here. I only need part of that vocabulary. Unfortunately W3C uses their own EBNF syntax so instead of tranlating it to vanilla EBNF I decided to try rewrite the relevant rules using your shortcut syntax, which I find quite neat.

However, I've bumped into problems with rule PN_PREFIX.

PN_PREFIX = PN_CHARS_BASE ((PN_CHARS | '.')* PN_CHARS)?
PN_CHARS_BASE = (* basically any Unicode letter character *)
PN_CHARS = PN_CHARS_U | '-' | [0-9] | #x00B7 | [#x0300-#x036F] | [#x203F-#x2040]
PN_CHARS_U = PN_CHARS_BASE | '_'

In short, PN_PREFIX should match the prefix of a QName URI. For example, given a QName rdf:type it would be matching the rdf part. As per the rule, the first character must be a letter, and then additionally characters are allowed.

I rewrote PN_PREFIX as

pn_chars_base & ~(-(pn_chars | '.') & pn_chars)

pn_chars_base matches the r and then the RepeatParser matches df, which is unfortunate because then pn_chars fails, because it doesn't match the colon, thus failing entire optional pattern.

The intent is that pn_chars_base, pn_chars inside repeat and the last pn_chars matched r, d and f respectively so that the entire pn_prefix matched rdf.

Any idea what's not right with my grammar?

cwensley commented 10 years ago

Hi,

Yeah, the RepeatParser is certainly greedy. Though you do have options for this scenario:

First, you can enforce the rule that a '.' must follow another character such as:

pn_chars_base & -((pn_chars & '.' & pn_chars) | pn_chars);

Or, if performance is a big concern (i.e. you are parsing millions of these things in a server scenario) you can use the RepeatParser's separator to allow an optional period in-between each character:

pn_chars_base & (-pn_chars).SeparatedBy(~(Parser)'.');

Hope this helps!

cwensley commented 10 years ago

Hm, thinking about that for a second, this will also work and is a little cleaner:

pn_chars_base & -(('.' & pn_chars) | pn_chars)

tpluscode commented 10 years ago

Thanks it works of course. I haven't done any grammars since studies. However the W3C syntax makes me wonder. I understand that the semantics of

(PN_CHARS | '.')* PN_CHARS

mean that matching string contains a PN_CHARS preceeded by zero or more PN_CHARS or a dot. However without peeking forward while parsing the repeated group the parser cannot know if any given matched character should actually be matched by the next production in sequence. Hence the greedy behaviour and it is logical.

The question though is, is it a valid EBNF notation and such parsers are actually used? Or is this syntax just more human-readable and indented to be easier to comprehend in written form?

cwensley commented 10 years ago

This is valid EBNF notation, and works with LALR parsers, though does not work with LL parsers like Eto.Parse. Being recursive descent, the repeat parser knows nothing about what should come after it. With LALR parsers, a huge 'table of possibilities' is typically created which allows it to handle patterns like this.

I've pondered the concept of adding look-ahead to Eto.Parse and it might be doable, however it may degrade performance which is not what I'd like to see.

tpluscode commented 10 years ago

Thanks for clarifying this for me.

I certainly won't need the enchanced functonality, given that I was able to achieve the desired result by adjusting my productions.

picoe / Eto.Parse

Greedy repeat parser problem #9