Open modulovalue opened 1 year ago
I've just found this: http://smlweb.cpsc.ucalgary.ca/example-grammars/ by chasing links starting at the readme of this project. This contains a bunch of grammar examples for various grammar classes.
Yes, those would be good examples. They're already imported for test fixtures: https://github.com/mdaines/grammophone/blob/76612cb618f22fa7aaf9ec53f2ddefb6194f85b2/test/fixtures/example_grammars.js
Here are two more that I think are interesting because they have practical relevance to e.g. parameter/argument lists that allow optional trailing commas:
This grammar attempts to capture interspersed lists where the separator can optionally occur at the end of the list.
The following grammar can be disambiguated by SLR(1): on grammophone
S -> L ','.
S -> L.
L -> L ',' a.
L -> a.
The following grammar wraps the list from above in a separate production rule which makes it not LR(1): on grammophone
S -> P ','.
S -> P.
P -> L.
L -> L ',' a.
L -> a.
A parser generator that naively transforms an "interspersion" construct to one with a level of indirection (like in the second grammar) would generate a grammar that is not LR(1).
Here's a gist I found that contains more: https://gist.github.com/lin/dc83bb38eb458ded3ff01aec4a327d54
This grammar demonstrates a source of inefficiency in the automaton of the standard LR(1) construction.
Looking at the LR(1) automaton, I think that state 13 & 18 and state 17 & 12 could be merged and the resulting automaton would still resolve the ambiguity that LR(1) was meant to resolve. only 6 & 9 and 14 & 19 are required to be distinct new states.
This grammar is given as an example in some slides that introduce langcc. However, I'm not sure if langcc eliminates this source of inefficiency.
It appears that the act of eliminating this inefficiency is known as "minimizing" an LR automaton.
Edit: the author of the booze-tools project wrote a little something about this grammar: https://boozetools.readthedocs.io/en/latest/minimal.html
Superseded by https://github.com/mdaines/grammophone/issues/26#issuecomment-1657506763
This grammar is LR(2), but can be easily converted to LR(1) by a procedure that heavily reminds me of recursion scheme fusion.
Source: "A grammar to recognise parameter lists"
(type_name id.','+).','+
two separate comma separated +
lists:
(type_name id.','*).','+
two separate comma separated lists where the outer list is +
and the inner *
:
Here are two other interesting grammars:
When implementing an LALR parser using the Deremer and Pennello relation-based algorithm, it is easy to implement something that is known as NQLALR (Not-Quite-LALR), and apparently, back when this was a hot research topic, many people made this mistake.
According to some papers, NQLALR doesn't fit into the SLR -> LALR hierarchy, meaning, there are grammars that are SLR and not NQLALR. This observation can be used to debug an LALR implementation.
The following papers present example grammars that can be used to do that:
Simple Computation of LALR(1) Lookahead Sets, Manuel Bermudez
An LALR(1) grammar that is non-SLR(1) and non-NQLALR(1).
An SLR(1) grammar that is not NQLARL(1).
The following paper discusses an incorrect oversimplification that one can make while implementing their algorithm.
Efficient Computation of LALR(1) Look-Ahead Sets, Frank DeRemer & Thomas Pennello
The paper doesn't discuss this, but I think, it is possible to look at the whole algorithm as a single dataflow problem. Their wording implies that two separate SCC runs are necessary, I conjecture that that is not the case. In any case, that grammar might also be valuable in verifying that an implementation is not incorrect in some obvious way.
According to the paper, the following grammar should be LALR(1) and an incorrect implementation would report it as not being LALR(1).
The following paper discusses a grammar that is "almost" LALR(1). It has 9 inadequate states under LR(0). 8 of them can be resolved using LALR(1) and there's one left that needs LALR(2).
Practical Arbitrary Lookahead LR Parsing, Manuel E. Bermudez and Karl M. Schimpf
DeRemer & Pennello conjectured that if the "includes" relation of their LALR(1) construction algorithm contains a nontrivial SCC (... this misses the read condition, see the paper), then the grammar can't be LR(k) for any k. This was proven by A Short Proof of a Conjecture of DeRemer and Pennello THOMAS J. SAGER. These are example grammars that Sager gives where that is the case:
In general, this property is undecidable. This is one of the few heuristics that are known for detecting whether a CFG is not LR(k) for any k.
The PhD Thesis by Philippe Charles on LALR(k) parsing and automatic error recovery discusses the following three grammars extensively (Figure 3.1, Page 31). They all model a simple BNF grammar. He uses them to highlight the importance of LALR(k) where k is > 1. ...
... (a) is LALR(2), (b) is a version of (a) that was converted to LALR(1) and (c) is a version of (a) that requires an explicit separator to be LALR(1) to maintain the shape of (a).
Note: this comment is very similar to this comment https://github.com/mdaines/grammophone/issues/26#issuecomment-1598023392, the transformation from (a) to (b) appears to be the same.
In Efficient Computation of LALR(1) Look-Ahead Sets, Frank DeRemer & Thomas Pennello, the authors prove that if there's a nontrivial SCC in the reads relation of their LALR(1) algorithm, then the grammar is not LR(k) for any k.
They give 2 grammars as examples. One is ambiguous, the other is not. This is the unambiguous grammar that is not LR(k) for any k. (The other ambiguous grammar is that one without the f at the end of the A production on grammophone.)
The following grammar is presented in the following paper that was published alongside langcc (cf. page 5 bullet 4):
Field -> S E.
Field -> E E.
S -> s.
S -> .
It is not LR(1), but can be made SLR(1) by the CPS transformation in that paper, which would transform it to the following grammar:
Field -> S.
Field -> E E.
S -> s E.
S -> E.
It can also be made SLR(1) by simply inlining S on grammophone.
In his dissertation on LR parsing (page 94, first bullet, third paragraph), Xin Chen claims that it is widely known that only reduce/reduce conflicts can be resolved by state splitting and not shift/reduce conflicts.
I think this is an interesting fact that doesn't seem obvious to me and I wanted to capture it here.
It can be observed in the first grammar in this issue and in the grammar mentioned in this comment. The LALR(1) parsing tables contain only reduce/reduce conflicts so, I guess, in an efficient implementation, it would make sense to start with state splitting to resolve those reduce/reduce conflicts and not to consider shift/reduce conflicts at all.
Considering that claim, it is clear that the (a) grammar in https://github.com/mdaines/grammophone/issues/26#issuecomment-1657506763 can't be determinized with state splitting alone. As the paper claims, either it needs to be transformed to (b), or more lookahead is needed.
This LR(2) grammar contains a reduce/reduce conflict in its LR(1) automaton and is used as an example for doing LR(k) at runtime in the dissertation by Xin Chen (Page 164)
It is also an example of a source of inefficiency (similar to https://github.com/mdaines/grammophone/issues/26#issuecomment-1597730431) that the canonical LR(1) construction has. LR(1) doesn't help with resolving any inadequate states (LR(0) has 3, LALR(1) resolves two, LR(2) resolves one), but the canonical LR(1) construction still splits one state. An efficient implementation of LR(k) should consider that and not split states that don't help with resolving inadequate states. (TODO or maybe it needs to split this state in LR(1) so that LR(2) can be effective?)
This mentions a yacc-grammar-snippet grammar for a language that is LR(2), but the grammar is not. This is also being discussed in Xin Chen's dissertation (here on Page 167). The first link in this comment contains some prose that discusses different ways for dealing with it by, for example, rewriting it to an LR(1) grammar such as b or c
After a little bit of cleaning up, we get the following grammar:
# rule+
yacc -> rule | yacc rule.
# rulebody ";"?
rule -> rulebody | rulebody ";".
# (RULENAME ':' alt."|"+)?
rulebody -> RULENAME ':' alt | rulebody "|" alt.
# (TOKEN | RULENAME)*
alt -> alt TOKEN | alt RULENAME | .
TODO: http://www.cs.man.ac.uk/~pjj/complang/howto.html TODO: http://www.cs.man.ac.uk/~pjj/complang/howto2.html TODO: http://www.cs.man.ac.uk/~pjj/complang/grammar.html TODO: http://www.cs.man.ac.uk/~pjj/complang/g2lr.html TODO: http://www.cs.man.ac.uk/~pjj/cs2121/ho/node7.html TODO: http://www.cs.man.ac.uk/~pjj/complang/g2ll.html TODO: How can we go from that to (b) or (c), see derivation from above.
Consider (b):
yacc -> RULENAME ':' rulebody T.
T -> ";" | yacc | ";" yacc | .
rulebody ->
| rulebody TOKEN
| rulebody RULENAME
| rulebody "|".
yacc -> RULENAME ':' rulebody.
T -> ";" | yacc | ";" yacc | .
rulebody -> T
| TOKEN rulebody
| RULENAME rulebody
| "|" rulebody.
Chris Clark suggested a grammar class that he calls LR(closed) here. An example grammar can be found on grammophone.
Quote:
Here is a grammar that isn't LR(k) for any k, for a language that
could have an LR(1) grammar for it--in fact the language is regular.
The language is (a|b)a+(a|b).
S : a A a ;
S : b A b ;
S : a B b ;
S : b B a ;
A : A a ;
B : B a ;
A : a ;
B : a ;
Note, this grammar is particularly interesting to me as it is part of
what we tried to solve in Yacc++, when we did LR(infinite) (or what I
sometimes call LR(closed) parsing. If you look carefully, you will
see that you can parse it deterministically, However, you have to push
an indefinite amount of rules on the "stack". Still, when you get
done, and you see the EOF marker, you can then deterministically
resolve the rules you had pushed. I believe that every unambiguous
CFG has such a deterministic parse, and I believe the computing the
"closure" is the method to find it (if it exists), which is why I
sometimes refer to the method as LR(closed). Unfortunately, due to
the halting probloem, there are grammars that attempting to compute
the closure on will never terminate.
It is also mentioned in Xin Chen's dissertation here on page 170.
This is interesting in a different way. Here, an LR split removes a local ambiguity on a 'b', but since it doesn't remove the local ambiguity on 'a', that ambiguity is duplicated with the split. the amount of local ambiguities can therefore increase when going from LALR(1) to LR(1) if one is not careful about how he counts them.
We can clone A and B to get the following grammar: on grammophone Not LR(1)
We can convert the left recursion to right recursion: on grammophone Not LR(1)
We can then apply CPS to move the right context to the left into the base case of the recursion: on grammophone SLR(1)
@modulovalue I decided to enable discussions as experiment, and created an example grammars section: https://github.com/mdaines/grammophone/discussions/categories/example-grammars
I appreciate you posting these grammars, but I was wondering if discussions might be a better format than a single issue. Let me know what you think; I'm not sure if I like GitHub discussions or not.
I still plan to add a few example grammars to Grammophone when I get a chance.
@mdaines I hope I didn't bother you too much with these examples. I'm wondering, is it possible for you to unsubscribe from notifications for this issue, or, since you've been mentioned and have participated in this issue, are you being notified about every new comment, even if you unsubscribe?
So far, my impression of the 'discussions' tab has been that most people don't use it.
I plan to eventually merge everything that I've shared here into a small set of examples that succinctly demonstrates important properties and transformations.
It's not bothersome. Feel free to keep adding them here if you like.
Also, if you end up publishing the examples you've collected somewhere else I'm happy to link to that.
Here's a simple formula for constructing a sample LR(k) grammar for k > 0 (source):
// Here c^k means the concatenation of k letters 'c'.
// This is a LR(k+1) grammar, depending on the value of k.
S -> a A D a.
S -> b A D b.
S -> a B D b.
S -> b B D a.
A -> a.
B -> a.
D -> c^k.
Here is an example of an "ill-founded" grammar as defined by the booze-tools project.
It seems to be similar to the "unrealizable nonterminals" metric that grammophone exposes.
It would be interesting to know whether the "ill-founded" concept adds something new or whether unrealizable nonterminals and ill-foundedness are one and the same thing.
Here, "well-founded" means "can possibly produce a finite sequence of terminals."
The opposite is called "ill-founded".
Here are two examples of ill-founded grammars:
S -> x S -- This is ill-founded because there's always one more S.
A -> B y; B -> A x -- Similar, but with mutual recursion.
A terminal symbol is well-founded. So is a nullable symbol, since zero is finite.
A rule with only well-founded symbols in the right-hand side is well-founded.
A non-terminal symbol with at least one well-founded rule is well-founded.
Induction applies. That induction happens to be called ``bipartite_closure``.
A grammar with only well-founded symbols is well-founded.
A naive implementation of GLR would have trouble with certain grammars containing epsilon productions. This is one: on grammophone
This is known as hidden left recursion.
## Tomita's GSS Algorithm:
When you boil it all down, Tomita's chief contribution is the idea to use
a directed acyclic graph rather than a tree of states: each node could have
more than one predecessor, and so after any given machine cycle, you have
at most one active graph-node per LR table state. (In practice, you'll have
many fewer such active nodes).
Neglecting epsilon rules, this works pretty well: You still have search
problem to carry out reductions, but the performance bounds are polynomial
in the size of the input string.
The problem with epsilon-rules is not immediately obvious, but it's easy
to trigger with a hidden-left-recursive grammar such as this:
S -> E S a (rule 1)
S -> b (rule 2)
E -> :epsilon (rule 3)
It should be clear upon inspection that this context-free grammar describes
in fact a regular language (`ba*`) but the obvious variations on Tomita all
fail to recognize the string `baa`. Why? Well, with lookahead `b` we can
certainly shift the first token of rule 2 (where we'll soon recognize `S`)
but we can also possibly reduce rule 3 before shifting the `b`, but we wind
up in the same state after shifting `b` both ways. Now if you carefully
play this scenario forward: The first branch dies; the second allows `ba`
but then can't go back in time to recognize however many `E` symbols must
have preceeded the first `b` so as to consume all the right number of `a`
tokens which follow.
There are a number of hacks which solve the problem for subsets of the
epsilon-grammars, but no simple solution for all of them.
Tomita recognized these difficulties but did not offer a complete solution.
The LALRPOP project provides an excellent explanation of how to implement LR(1) through the lane tracing method:
First example: on grammophone & the explanation Second example: on grammophone & the explanation
This blogpost appears to be an extended version of that readme.
The IELR paper claims:
In this paper, we demonstrate that a well known algorithm described by David Pager and implemented in Menhir, the most robust minimal LR(1) implementation we have discovered, does not always achieve the full power of canonical LR(1) when the given grammar is non-LR(1) coupled with a specification for resolving conflicts.
This is the grammar that they use on grammophone.
Notice how grammophone identifies that there are 4 example sentences.
[...] If he has declared a as left- associative, the parser chooses to reduce in the cases of both aa·a and aa·aa. In this way, the grammar author has specified a new language consisting of only 3 sentences: aaa, bab, and baab.
So, apparently, a strategy that uses associativity declarations for removing ambiguities could lead to a different language being recognized.
A very simple LR(2) grammar on grammophone with some prose (source) that shows how to rewrite that grammar to LR(0) by removing duplicates.
This grammar shows that it might make sense to have some form of duplicate detection that identifies identical production rules.
This repo contains A TON of random LR(k) grammars and a bunch of LR(1) grammars in bison syntax.
This is an example grammar that is meant to be LRR (LR-regular): on grammophone.
It is a proper superclass of LR(k). That grammar comes from the following paper that introduces LRR:
LR-regular grammars—an extension of LR(k) grammars, Karel Čulik II and Rina Cohen
There's a somewhat common source of ambiguities in LR-parsers that many people are pointing out where infinite lookahead would be needed to remove all inandequate states, so those grammars can't be LR(k) for any k. I'm wondering whether LRR could help with that. I'm wondering about the relationship between Chris Clark's LR(closed) grammars and LRR.
It would be nice to find some minimal LRR grammars so that their automata can be inspected more easily.
I'm also wondering if LRR could be generalized to something like "LR-CF" where the lookahead sets themselves are not just regular, but descriptions of a context-free language. Or maybe at least visibly pushdown languages (https://github.com/ianh/owl), which support a limited form of recursion, which regular languages don't, and have desirable properties that context free languages don't have, but regular languages have.
Leo optimizes Earley-style parsers to take linear time for LRR grammars.
One parser that claims to be able to parser LRR grammars is: Marpa, A practical general parser: the recognizer, Jeffrey Kegler.
This paper by Kristensen and Madsen proposes to extend the LR-theory from CFGs to ECFGs. It argues that by doing that, many conflicts can be avoided.
(See page 8 for reference):
The following grammar:
S -> E.
E -> a b* b d.
E -> a b* b c.
can be desugared into a left-recursive version (on grammophone) or a right-recursive version (on grammophone), both of which are not LR(1).
The LR(1) automaton of the left-recursive version contains an r/r conflict where both productions are epsilon productions in one inadequate state.
The LR(1) automaton of the right-recursive version contains 2 of same r/r conflicts, but with an additional shift conflict for each in two inadequate states.
If we canonicalize b*
, then the left-recursive version becomes SLR(1) (on grammophone), the same trick doesn't work with the right-recursive version (on grammophone). It still has a s/r conflict. More lookahead might be able to resolve this conflict.
Left-recursive definitions in bottom up parsers are more efficient because they don't need O(n) stack space. However, something that is not clear to me is the question of which kinds of inadequate states can benefit from a right-recursive definition. right-recursive definitions don't appear to be strictly worse (excluding performance).
From a performance perspective, left-recursive definitions are superior. But left-recursive definitions can introduce inadequate states that right-recursive definitions can't introduce (https://github.com/mdaines/grammophone/issues/26#issuecomment-1817510191).
On page 9, Kristensen and Madsen show an alternative desugaring that is SLR(1). (There's a typo there, the last production in both grammars should be B -> b
, otherwise they recognize a different language. Or that section is just wrong):
left-recursive: on grammophone right-recursive: on grammophone
But even there, the left-recursive version is LR(0), and the right-recursive version isn't. So it might make sense to prefer the left-recursive definition there (ignoring any performance benefits) because the increased expressivity might propagate to fewer inadequate states in more complex grammars?
If we apply the CPS idea from langcc, we can make the right recursive version LR(0): on grammophone
Consider page 9 and page 33:
A naive desugaring produces a non-LR(1) grammar: on grammophone. If we canonicalize, we get an LR(0) grammar on grammophone, but by doing that we would lose meaning (e.g., the semantic actions would have to be the same so an automatic transformation is not practical here.)
If we apply the automatic transformation, we can keep meaning and have an LR(0) grammar: on grammophone
Kristensen and Madsen don't seem to provide a method for applying their transformation automatically to ECFGs, but 50 years later, langcc did that, although the author doesn't mention that in the paper.
I think the key question here is whether we could benefit from extending the definitions of nullsets/firstsets/(slr) followsets/(lalr) lookaheadsets to ECFGs and introduce new actions in LR automata to support + (and interspersed +), (and interspersed ), ? and | explicitly as first class operations without having to desugar them to a mutilated CFG. (regular expression to NFA to DFA conversions already need to implicitly consider such extended definitions of nullsets/firstsets/followsets, so this shouldn't be anything new.)
Applying any such transformations makes it much harder to keep a sane parse tree or do incremental parsing and, in general, it just complicates things by a lot.
I haven't seen a single project that implements what the paper says, i.e., extend the LR-theory to support ECFGs as a first class specification. Maybe by doing that we can have all the performance benefits of left-recursive definitions, the hypothetical additional expressiveness of right-recursive definitions, support incremental parsing and have sane parse trees without having to do anything really complicated.
(Note: it is not straightforward to restore ASTs from regular expressions, so I guess it won't be straightforward to restore parse trees from first class ECFGs. https://github.com/ianh/owl shows that the parse tree problem for regular expressions can be solved.)
What follows is a minimal example of what that post says:
If we want to parse a list of comma separated values with an optional trailing comma, we can:
Furthermore, we can:
Note: A right recursive definition is not enough if the optional trailing separator is not in the base case of the right recursive definition: (on grammophone => not LR(1)). Joe's CPS transformation achieves precisely that, that is, it moves the optional trailing separator into the base case.
However, we can have an LR(1) left-recursive definition of that language: on grammophone, so right recursion is not strictly necessary.
To conclude, a quote from the linked blogpost:
The idea that “left recursion is good, right recursion is bad” is a myth, at least in terms of conflicts. There are ways of working with left recursion and ways of working with right recursion; they are just not the same.
And to be complete, here's the example where a left recursive definition is superior:
Consider right-recursive on grammophone & left-recursive on grammophone
Notice how the rec
nonterminal in the left-recursive version contains a larger followset. label
can follow rec in the left-recursive version, but not in the right-recursive version. Since LALR lookaheads are partitions of followsets, a similar observation could be made for LALR lookaheads.
Note:
~Right recursive (x.',')*','?
: on grammophone~
The CPS technique (https://github.com/mdaines/grammophone/issues/26#issuecomment-1716641044) introduces a transformation that is able to move the right remainder of a rule into a nonterminal to its left. However, it would also make sense to do the same with left remainders, and move them into nonterminals to the right.
This is a comma separated list that accepts an optional trailing comma on grammophone. Applying such a "Left CPS" transformation would result in the following grammar: on grammophone. Here, the intermediate unnecessary O was able to be removed and the grammar became more compact. Therefore, such a Left-CPS transformation is able to make grammars more amenable to an optimization that removes unit productions. Also, applying "Left CPS" and inlining the intermediate O can lead to fewer conflicts. (TODO find a small repro grammar for this observation)
Left-CPS can only be applied to lists if they are left recursive and Right-CPS only to right recursive lists. (Why? Because lists and of the left if they are left recursive and on the right if they are right recursive).
See: http://www.cs.man.ac.uk/~pjj/complang/heuristics.html
That page contains a bunch of equalities that can be used to rewrite grammars. I'm wondering if it would be possible to find more by, e.g., formalizing CFGs (ECFGs?) and their algebraic rules as an algebra and putting that through, e.g., the knuth bendix completion procedure (and maybe we can even find a confluent term rewriting system)?
Some transformations are not obvious to me, especially the self referential ones. Can an axiomatic system be defined to make their correctness more obvious by deriving them from basic algebraic manipulations?
(A* # B)*
= (B* # A)*
= (A+|B+)*
= (A|B)*
=> made initial grammar SLR(1)
Here's another example of a grammar (a* a
) that can benefit from the CPS transformation.
Before: on grammophone Not LR(1)
After: on grammophone SLR(1)
Maybe this observation could be translated into a CPS-eligibility criterion: let the grammar be X* Y
, if any of first(Y) is in first(X), then the CPS transformation should be applied?
GFGs (https://github.com/mdaines/grammophone/issues/27) introduce the idea of applying a traditional program optimization technique known as procedure cloning to CFGs.
Consider the following grammar (for this regular language: (m|n)*mf(m|n)*n
): on grammophone. It is not LR(1). However, by combining procedure cloning and CPS we can make it SLR(1): on grammophone. We clone the flattened (m|n)*
L list into L1 and L2 and replace each use of L with a unique list. This allows us to apply CPS without losing any information related to semantic actions.
We can make the grammar SLR(1) by using a left-recursive list: on grammophone. However, if we add a level of indirection with the left-recursive list, the grammar loses it's SLR(1) property and it becomes not even LR(1): on grammophone. The right-recursive version is not susceptible to this level-of-indirection-induced inadequacy: on grammophone. The topic of unit rule elimination might be relevant here: https://github.com/mdaines/grammophone/issues/26#issuecomment-1497843024
Terence Parr and Russell Quong discuss the need for LL(k) & LR(k) where k > 1 in
LL and LR Translator Need k.
Example 1: We illustrate the advantage of an LL(2) grammar over an LL(1) grammar for recognizing a fragment of the C language. Consider distinguishing between C labels ``ID :'' and C assignment statements "ID = ...", Where ID represents an identifier token. In the following grammar fragment, rule {\tt stat} requires two lookahead symbols and is easily expressed with an LL(2) grammar. This common language feature is hard to express with an LL(1) grammar because the ID token is matched in many different grammar locations. Left-factoring the common ID prefix in stat} and expr would results in a much less natural grammar.
Although manual left-factoring can sometimes solve the problem, it is not ideal for two reasons. First, even if left-factoring yields a reasonable grammar, the LL(k) grammar for k>1 is simply more convenient, which tends to reduce development time. Secondly, while left-factoring might be theoretically possible, it can be practically implausible, as in this example, where expr occurs throughout the grammer.
The next example, example 2, discusses the issue from the following comment in this issue https://github.com/mdaines/grammophone/issues/26#issuecomment-1777186742.
The other examples don't appear to add anything new to this issue. A lot of that article discusses how actions can reduce the expressivity of LR parsers to that of an LL parser, but that's a big can of worms that I'm not going to try to investigate in this issue.
The CPS paper appears to only consider basic EBNF operations such as * and + in its flattening scheme and CPS transformation.
We can go further. One might need to have multiple base cases (such as when we want to parse lists with interspersed commas and optional trailing commas, see: https://github.com/mdaines/grammophone/issues/26#issuecomment-1817510191). The paper only considers those with one base case. We can extend CPS to flattening procedures that introduce multiple base cases. I'm going to call this Extended CPS (ECPS)
Here's an example of a basic ECPS application with multiple base cases:
That simple grammar is a snippet from the Dart grammar that describes a reduced version of function expressions and record patterns.
Notice how the version without CPS is not LR(1), but the version with CPS is LR(0).
The CPS paper describes a very basic form of detecting which nonterminals are eligible for CPS. It is only able to be applied to EBNF symbols that were flattened before. I think we can do better. A better CPS transformation should not have to depend on the flattening scheme and be able to find CPS eligible symbols by examining the structure of the CFG by, for example, looking for cycles in the CFG (maybe it's enough to consider SCCs (via Tarjan) that consist of a single fundamental cycle (via Johnson with a single entry-point to determine CPS eligibility (if there are multiple entry-points, procedure cloning could help). If the recursion tails on the right, it should be eligible for left-CPS and if it tails on the left, it should be eligible for right-CPS). (Also, see: https://github.com/jzimmerman/langcc/issues/48#issue-1826975806 for the same idea but for optionals)
This link contains a tutorial from Cornell on how to turn a Java grammar LALR(1). There might be something interesting there.
TODO implement the examples in grammophone
This link contains a tutorial from one of the only LL(k) parser generators (SLK) on LL(k). I'm not focusing on LL parsing, and most of the stuff here is covering LR-parsing, but that page is very detailed so it deserves to be mentioned.
TODO implement the examples in grammophone
Consider: on grammophone
S -> A | B.
A -> B1 a.
B -> B2 b.
B1 -> B1 b | b.
B2 -> B2 b | b.
(From: A Methodology for Removing LALR(k) Conflicts, right after Figure 1)
This is not LALR(k) because it needs a non-finite amount of lookahead to resolve whether to reduce B1 or B2. However, the language itself is regular.
I don't know of any strategies for dealing with this. It would be interesting to find a way for detecting conflicts of such kind and dealing with them.
Here's a simple grammar that is LALR(1) but not SLR(1).
S -> a A c | a B d | B c.
A -> z.
B -> z.
Hello @mdaines,
(I hope you don't mind another issue)
I think it would be awesome if there was a way to load sample grammars (and perhaps make it easy to submit them via PRs?). I'm thinking of something like the "Load Sample" feature in edotor, but with a short description for why the grammar is considered to be interesting.
"Mysterious conflict"
Here's one grammar that I think is interesting source:
on grammophone
It demonstrates a grammar that is LALR(1), but not LR(1). The conflict is known as a "mysterious conflict".