Implement a spike solution for `is parsed`

The purpose here is to build an MVP which can demonstrate the act of is parsed switching parsers. It still feels like #293 or even #177 are really big steps to take — this is meant to be a smaller (if not small) step.

Implemented in 007. (Although compiling steps are fine to write in Perl 6.)
Inputs (programs) are parsed by a parser, generated by a grammar. The parser, for our purposes, implements the interface Parser<T> with the sole method parse(Str): T. The grammar is (informally) a specification of what the parser accepts, and what result it builds.
Parsers are immutable (seen from the outside), but at some point a parser may need to hand over to another, extended one. Sometimes, control is then handed back.
Because it's simpler, we should probably have the parser interpret the parser states rather than compile down to an efficient code format.

I suggest we name this spike "Trifle", because for once, this spike is to be trifled with.

Sublanguage

Let's think up a sublanguage of 007; call it 001. It has (non-negative) integers, identifiers, my declarations, infix +, infix =, expression statements, block statements, if statements, and while statements (but no block parameters). Just large enough to be interesting to extend.

Here's a (Perl 6) grammar describing 001's syntax:

grammar {
    rule TOP { <statementlist> }
    rule statementlist { <statement> *%% ';' }

    rule statement:expr { <expr> }
    rule statement:block { <block> }
    rule statement:if { 'if'» :: <xblock> }
    rule statement:while { 'while'» :: <xblock> }

    rule xblock { <expr> <block> }
    rule block { '{' ~ '}` <statementlist> }
    rule expr { <term> +% <op> }

    rule term:int { \d+ }
    rule term:identifier { <identifier> }
    rule term:my { 'my'» :: <identifier> }
    rule identifier { <[a..zA..Z]>\w* }

    rule op:add { '+' }
    rule op:assign { '=' }
}

(Cute, huh? Perl 6 grammars really shine here. I haven't tested this, so it's probably correct modulo whitespace.)

Although not shown here, it's implied that the parse also gives back a Qtree; same one 007 would. Error handling is not our focus here; a parse that fails simply throws a very boring exception.

Where it gets interesting

Oh, and import statements are also supported in 001. (To 007's slight annoyance.) The syntax allowed is import * from x.y.z; — if the x.y.z is syntax.stmt.until, then an until statement is injected. (If it's anything else, it's a syntax error.)

Importing syntax.stmt.until installs a new statement type statement:until, analogously to statement:while. This is where a parser handover happens, to a new parser with this new grammar rule. The new parser stays in control until block end, at which point it hands control back to the original parser.

(If this works out well, could also try the same, in different orders, with syntax.stmt.unless.)

NFAs

This next section assumes familiarity with the ideas in the regexp1 article; specifically NFAs.

The big job is to have a parser running in 007 that will (a) parse 001 and (b) be extensible. I'd rather generate this 007 code than write it by hand — and besides, there needs to be something that's dynamic enough in the code itself for (b).

So we need to generate an NFA. Does the grammar above easily translate to an NFA? Let's go through things feature by feature:

Sequence/concatenation. Easy. Just string states horizontally.
Alternation. We don't use explicit alternation in the grammar, but the rule alternatives implicitly form alternations. Just lay them out vertically in parallel, and connect start and end states.
Character classes. These are alternations of individual characters. For some N characters, we can model that by having N distinct transitions to the next state.
*Quantifiers (+ and ``)**. Also covered by the article. These are just some arrows back.

But then there are some Perl 6-specific tricks that we need to think about separately:

*Custom separators (+% and `%)**. Not so bad. For example, % ';'is just sugar for[? [';' ]]?. Similarly for+%. (_Edit_: Actually, it's %% ';', allowing for an extra;after the last statement. The desugar for that would be[? [';' ] ';'?]?`.)
Word boundaries (»). You simply can't know if you have an if statement if you've only seen if in your input; the next character can be a 0x20 space (if statement) or another f (identifier). But as soon as the NFA sees that next character, it knows to pick between if statement and identifier. So we're good.
Backtracking control. The :: discards other alternative rules that could have matched. I'll be honest and say I'll need to implement this in order to get a feel for it. In the best case, we'll just nuke all the other "current states" except for the present one.
Terminator operator. Again, '{' ~ '}' <statementlist> is just syntactic sugar for '{' <statementlist> '}', at least when we disregard better error messages and such luxuries.
Side effects. We'll need to run 007 code (a) when a my registers a new variable, (b) when we enter and exit blocks, and (c) when we import syntax.stmt.unless.

So it looks like we're fine expressing all this as an NFA. The side effects are bending the rules a bit; specifically, it requires that we're no longer in the declarative prefix (and so it's a kind of implicit :: to have a side effect).

Missed a biggie.

Recursion. Not OK in a normal NFA. In the grammar we have mutual recursion via the path statementlist → statement:block → block → statementlist. This means that the state graph alone either can't be reified in its entirety (because it's effectively infinite), or that it's not enough to describe the state of the parse.

Not sure how to address this one. We're not getting rid of the recursion, since that's part of the language. In a way it's good that 001 is already complex enough to have recursion; otherwise we'd just have hit it later and been thrown off by it then.

I have two thoughts. One is that the parser handover will need to factor into all this somehow. Not saying those two things will interfere, but they'll need to work together.

Secondly, maybe we'll end up using the state graph not for all parsing, but just for LTM parsing. Something — a hope, maybe — tells me recursion doesn't become a problem if we limit the states to handle declarative prefixes. The rest is behind a "procedural event horizon", and weird things like recursion are allowed to happen there, outside of the pure NFA world.

001 requires a semicolon after a non-last blockish statement.

my x = 5;
if x {
    x = 0;
};
x;

As the grammar is specified, that ; after the } is mandatory.

I think I'm fine with that. Technically, 001 is still a strict subset of 007 (in the sense of "it allows fewer productions, but not more"). Being allowed to skip that ; feels like a "nice extra" — it's not in the charter of 001 to be ergonomic in that way. To be honest I'd rather it stay simple.

Secondly, maybe we'll end up using the state graph not for all parsing, but just for LTM parsing.

Coming back to this; this shouldn't have been a big reveal, actually. If NFAs were as powerful as grammars, then grammars wouldn't have to make the declarative/procedural distinction in the first place.

Specifically, NFAs (and regular expressions) famously don't parse arbitrarily nested languages, of which even the stupidly simple 001 is one, and certainly 007 and Perl 6. And basically any language with the concepts of blocks in it.

But, keeping our eyes on the ball: the winning condition here isn't "NFAs parse the entire language", it's "NFAs do dispatch into (procedural) rules" plus "NFAs can be extended during the parse". Having said that, I still don't feel I have a handle on exactly how that dispatch works, and how the interaction between parsers and the rules that want to extend them should work.

As I was thinking about this while falling asleep yesterday in a jet-lagged haze: there are two parsers, really. The declarative one and the procedural one.

The declarative parser is doing "last-minute tokenization": it holds a bunch of entry points, a statically-known set of places in the grammar where the declarative parser needs to take over for a bit and do its job. It proceeds character by character, maintaining a set of states like a Thompson engine.
Its job is to read just enough of the input to be able to dispatch into a rule in the procedural parser. (Or fail the parse.) (It strikes me that if a certain rule in the grammar is fully declarative with no procedural part, we don't need to dispatch. Probably we do anyway, though, in order to update some state or other.)
At the point when the declarative parser is doing its dispatch, it also knows exactly at what point it will regain control. In other words, at the point it's losing control it knows at which entry point it will resurface and regain it from the procedural parser. All this can be statically computed; please visualize it with me. Instead of goal states, each NFA has a number of dispatch states. Dotted lines connect these with entry points, which are start states in some other NFA. The dotted lines don't reveal anything about the procedural parser — from the point of the declarative parser, it's alone in the world, there are no side effects, ever, and an unknown external agent is giving it a textual offset at each entry point.
Now, the procedural parser is a much more familiar beast. We can think of it as running on a VM with instructions that get traversed by an interpreter. (I was going to suggest just making it a flat set of instructions, but now I'm thinking it might be neater to express it as s-expressions; especially for the quantifiers that might hold structural information about the parse much better.)
The entire grammar, with all its rules, can be conceptually "covered" with two colors; one for the declarative parser and one for the procedural parser. Since declarative gets a reset per rule and procedural happens some ways into a rule, we talk about a rule's declarative prefix.
Side effects are allowed inside the procedural parser. Most concretely, blocks of code can occur in the middle of a rule, and they will run at that point. If quantifiers and backtracking are thrown into the mix, a block may run several times, even. (Or zero times.)
Rules can contain calls to other rules ("subrules"). In the case of the procedural parser, this works very much like a normal function or method call. Calling a subrule might wake up the declarative parser again, to do dispatch into the exact rule to be called. (Hm, idle question: is this the entry point the declarative parser expected to wake up in? If so, I think there might be several of those per dispatched subrule.)
Subrules are allowed to call each other in a cyclic/mutual way, or even to call themselves. (Again, we have a case of that in the 001 grammar above, with statementlist → statement:block → block → statementlist.)
Left-recursive subrule calls (that is, calls not guaranteed to advance the parse) are fine but might lead to a regex that loops infinitely (or blows the call stack).
The declarative parser has no concept of subrule calls. Instead, at a call point, a smaller called rule gets inlined into a bigger caller. This requires some care, see below. In general, the declarative parser has a "flat" view of the grammar. The structure formed by the grammar's call graph is transparent to it, and it "sees into" altenations and subrule calls, as long as the declarative prefixes allow.
Recursion and subrule flattening do not mix well. The details are not super-clear to me, but basically (a) calls into subrules with a procedural part already count as a procedural barrier in the calling rule, (b) in the cases where the question "is this subrule procedural?" also ends up in an infinite regress, the subrule is deemed procedural. (I don't think the (b) tiebreaker is absolutely necessary. But it might be a fair simplification.)
Left-recursion and subrule flattening extremely do not mix well. But fortunately, it's very easy to detect, and it's always an error, so we might as well refuse to compile it.
From the point of view of the procedural parser, it gets called at various offsets in the input text, does its job, and hands back control. It doesn't "see" the declarative parser. It does keep some state around between calls — notably the rule call stack and the AST being built — but apart from that it's just a set of distinct routines being called by some unseen agent.

Forgot to mention: work is now underway in a branch. Current status: almost at the point of hitting the limits of the declarative parser.

I'm toying with the idea of maybe being able to create the NFA for the declarative parts from the s-expressions of the procedural parts. Execution paths of the world, unite!

Here, I tried to capture the thing with the NFA dispatch and the entry points in a set of diagrams. They are messy and probably mostly wrong, but here goes.

IMG_20190409_114609

Preliminary conclusions:

There's quite some repetition, because (as stated above) smaller rules get copied into bigger ones. For example, expr and term have the same entry point since the expr rule unconditionally starts with <term>. Almost everything ends up in TOP. (The lower right diagram tries to depict the copying.)
There's something interesting going on with statementlist. Not just because the final ; is optional, but because it matters whether the statementlist was called at the "top level" of the program, or inside of a block. In the former case, end-of-file is a valid thing to encounter but } is not; in the latter case, it's vice versa. In retrospect, this shouldn't be super-surprising, since we have two calls to <statementlist> in the grammar; and these happen in different "environments". But in some sense it means we need to have two slightly different copies of the NFA graph for statementlist.
Both expr and block are recursive, and NFA dispatches can happen inside of them, but there's no sensible way to graph that because the possibilities are literally countably infinite. The closest I got to showing this was to use slightly rounder nodes for "this is where we give control to the procedural parser".

The investigation continues.

Trying to figure out how to compute those entry points. Here's me doing it manually:

TOP / statementlist / statement is one
statementlist / statement from inside block (different stopper)
expr /term
op

Ok, so I think that the rule is this: (a) something else is calling the rule (another rule, or in the case of TOP, the parser itself), (b) after flattening, the "first thing" that happens is an alternation (including multi tokens) requiring dispatch — for example, statement is a proto with multis, and it gets flattened into statementlist which is flattened into TOP; (c) all other things equal, two different calls result in two different entry points.

There are also three "variants" of the expr rule:

The one called by statement:expr outside of any block (has eof as a stopper)
The one called by statement:expr inside of a block (has } as a stopper)
The one called by xblock (has { as a stopper)

I don't feel I'm very good at describing what's going on here, so I content myself with feeling parts of the elephant one at a time.

It feels to me one should be able to build a "continuation graph" from a grammar; for each rule, which rules can it dispatch into. Note that the distinction of "calling down" vs "returning up" has been erased from this graph — it's something that the procedural/stateful part of the parser cares about, but not the declarative/dispatching part.

Here's my attempt to compute this manually for 001:

TOP/statementlist/statement → expr, block, statement, (eof)
TOP/statementlist/statement (in block) → expr, block, statement
xblock → expr, block, statement,(eof)`
xblock (in block) → expr, block, statement
expr/term → op, statement, (eof)
expr/term (in block) → op, statement
expr/term (in xblock) → op, block
(identifier identical to expr/term, even though it doesn't include term:int)
op → term

Mentally, the algorithm feels a bit like this:

For a rule, go through each subrule call and mark a continuation between them.
...but make an exception for leftmost calls, because these aren't really "continuations", they are "beginnings", and the flattening effect serves to make them synonymous.
(I guess if during such flattening we ever find a cycle, issue a stern impossibility error. We don't like your left-recursive kind around these here parts.)
At rule end, also consider the way this rule was called. Mentally return up the call chain and consider the next subrule in the caller to be a continuation.
...except there might be more than one caller; in such a case (as above), consider each caller to be calling into a distinct rule. ("Clothes make the man. Context makes the rule.")
...also remember that (eof) is a possible continuation when we fall all the way out of TOP.

Yes, I think this can be mechanized. One thing I'm not sure of is whether it would also make sense (or be necessary, even) to split the rules up...

(getting off the train, thinking of it some more)

...yes, we need to do that. It feels quite similar to CPS-transforming a bit of async/await code. The result will be quite a lot like my doodle above, but with more intermediate states. It's like we're breaking up each rule (multiplied with each context) into "basic blocks", small rule-fragments that eventually end up in an NFA dispatch that leads to another rule-fragment.

I think it'd be possible to put together a small proof-of-concept of this algorithm. Like a small mini-spike within the spike.

There's some very neat strange consistency going on between multi-method dispatch (figuring out which method to call based on the arguments sent in) and LTM (figuring out which subrule to end up in based on a part of the text currently being parsed).

Also, we could lean on this consistency by (for example) handling the @Override story similarly for both functions and @parsed macros.

masak / alma