Confusing start symbol and missing ambiguity warning.

modulovalue commented 1 year ago

Consider the following grammar:

start = kabstract | ident
kabstract = "abstract"
ident = atoz_lower+
atoz_lower = "a"|"b"|"c"|"d"|"e"|"f"|"g"|"h"|"i"|"j"|"k"|"l"|"m"|"n"|"o"|"p"|"q"|"r"|"s"|"t"|"u"|"v"|"w"|"x"|"y"|"z"

If we use the playground to parse abstract, we get the following result:

. abstract 
  kabstract

this is a little confusing to me:

it's not clear what the start symbol is, or how one might select it themselves. Does owl use all unused nonterminals as the alternatives of a pseudo start nonterminal symbol as the start symbol?
kabstract and ident are ambiguous because "abstract" can be matched via kabstract or via "a" "b" "s" "t" "r" "a" "c" "t" via ident. Why does owl not report any ambiguities here?

ianh commented 1 year ago

The input is tokenized before parsing, and “abstract” will always be tokenized into a single identifier rather than a sequence of letter tokens.

The start symbol is always the first rule: see https://github.com/ianh/owl/blob/master/doc/grammar-reference.md

modulovalue commented 1 year ago

The input is tokenized before parsing, and “abstract” will always be tokenized into a single identifier rather than a sequence of letter tokens.

How do you decide to tokenize the input into a single identifier? Do you derive a regular grammar from the given specification? And if you do that, how do you decide to, in this case, prefer kabstract over ident?

In any case, if we build an NFA for kabstract and ident each, and if we connect them via epsilon transitions from a new start state, and build a DFA, the resulting automaton would contain an ambiguous state for the input "abstract" because kabstract and ident both match "abstract".

If we remove the kabstract rule, then the input is parsed into an ident, as expected:

start = ident
ident = atoz_lower+
atoz_lower = "a"|"b"|"c"|"d"|"e"|"f"|"g"|"h"|"i"|"j"|"k"|"l"|"m"|"n"|"o"|"p"|"q"|"r"|"s"|"t"|"u"|"v"|"w"|"x"|"y"|"z"

input: abstract

. a          b          s          t          r          ..
  atoz_lower atoz_lower atoz_lower atoz_lower atoz_lower   
  ident----------------------------------------------------
 .. a          c          t         
    atoz_lower atoz_lower atoz_lower
  ----------------------------------

And even an unreachable kabstract leads to an error during runtime.

start = ident
ident = atoz_lower+
kabstract = "abstract"
atoz_lower = "a"|"b"|"c"|"d"|"e"|"f"|"g"|"h"|"i"|"j"|"k"|"l"|"m"|"n"|"o"|"p"|"q"|"r"|"s"|"t"|"u"|"v"|"w"|"x"|"y"|"z"

---

input: abstract

error: unexpected token 'abstract'

  abstract
  ~~~~~~~~

---

input: foobar

. f          o          o          b          a          ..
  atoz_lower atoz_lower atoz_lower atoz_lower atoz_lower   
  ident----------------------------------------------------
 .. r         
    atoz_lower
  ------------

Could it be that the determinization process is not quite complete in cases like these?

ianh commented 1 year ago

There's a separate tokenization step that looks at all the "atomic patterns" (as defined in https://github.com/ianh/owl/blob/master/doc/grammar-reference.md) in the grammar and does maximal-length matching. Here's the text from the manual:

Owl decomposes the input text into tokens based on all the atomic patterns that appear in the grammar. At each point, the longest token which matches the input text is chosen. In case of ties between keywords and other tokens, the keyword is chosen.

It doesn't matter whether or not the rule is reachable, just whether or not the pattern appears somewhere in the grammar.

ianh commented 1 year ago

BTW, I should mention, using owl as a lexical analyzer seems like it could be possible, but it's not what it's designed for (as you're probably realizing at this point). The ideal case it's built to support is languages that use built-in token types like number and identifier together with fixed keywords separated by ASCII whitespace. A more powerful way to specify tokens (using regexes or something like that) would be nice, but for the time being, custom token parsing is left to the C code that calls into the generated parser.

modulovalue commented 1 year ago

Thanks again for your help! I'm going to close this issue because #38 failed.

ianh / owl

Confusing start symbol and missing ambiguity warning. #42