Closed modulovalue closed 1 year ago
The input is tokenized before parsing, and “abstract” will always be tokenized into a single identifier rather than a sequence of letter tokens.
The start symbol is always the first rule: see https://github.com/ianh/owl/blob/master/doc/grammar-reference.md
The input is tokenized before parsing, and “abstract” will always be tokenized into a single identifier rather than a sequence of letter tokens.
How do you decide to tokenize the input into a single identifier? Do you derive a regular grammar from the given specification? And if you do that, how do you decide to, in this case, prefer kabstract over ident?
In any case, if we build an NFA for kabstract and ident each, and if we connect them via epsilon transitions from a new start state, and build a DFA, the resulting automaton would contain an ambiguous state for the input "abstract" because kabstract and ident both match "abstract".
If we remove the kabstract rule, then the input is parsed into an ident
, as expected:
start = ident
ident = atoz_lower+
atoz_lower = "a"|"b"|"c"|"d"|"e"|"f"|"g"|"h"|"i"|"j"|"k"|"l"|"m"|"n"|"o"|"p"|"q"|"r"|"s"|"t"|"u"|"v"|"w"|"x"|"y"|"z"
input: abstract
. a b s t r ..
atoz_lower atoz_lower atoz_lower atoz_lower atoz_lower
ident----------------------------------------------------
.. a c t
atoz_lower atoz_lower atoz_lower
----------------------------------
And even an unreachable kabstract leads to an error during runtime.
start = ident
ident = atoz_lower+
kabstract = "abstract"
atoz_lower = "a"|"b"|"c"|"d"|"e"|"f"|"g"|"h"|"i"|"j"|"k"|"l"|"m"|"n"|"o"|"p"|"q"|"r"|"s"|"t"|"u"|"v"|"w"|"x"|"y"|"z"
---
input: abstract
error: unexpected token 'abstract'
abstract
~~~~~~~~
---
input: foobar
. f o o b a ..
atoz_lower atoz_lower atoz_lower atoz_lower atoz_lower
ident----------------------------------------------------
.. r
atoz_lower
------------
Could it be that the determinization process is not quite complete in cases like these?
There's a separate tokenization step that looks at all the "atomic patterns" (as defined in https://github.com/ianh/owl/blob/master/doc/grammar-reference.md) in the grammar and does maximal-length matching. Here's the text from the manual:
Owl decomposes the input text into tokens based on all the atomic patterns that appear in the grammar. At each point, the longest token which matches the input text is chosen. In case of ties between keywords and other tokens, the keyword is chosen.
It doesn't matter whether or not the rule is reachable, just whether or not the pattern appears somewhere in the grammar.
BTW, I should mention, using owl as a lexical analyzer seems like it could be possible, but it's not what it's designed for (as you're probably realizing at this point). The ideal case it's built to support is languages that use built-in token types like number
and identifier
together with fixed keywords separated by ASCII whitespace. A more powerful way to specify tokens (using regexes or something like that) would be nice, but for the time being, custom token parsing is left to the C code that calls into the generated parser.
Thanks again for your help! I'm going to close this issue because #38 failed.
Consider the following grammar:
If we use the playground to parse
abstract
, we get the following result:this is a little confusing to me:
"a" "b" "s" "t" "r" "a" "c" "t"
via ident. Why does owl not report any ambiguities here?