masak / alma

ALgoloid with MAcros -- a language with Algol-family syntax where macros take center stage
Artistic License 2.0
139 stars 15 forks source link

Implement a spike solution for `is parsed` #485

Open masak opened 5 years ago

masak commented 5 years ago

The purpose here is to build an MVP which can demonstrate the act of is parsed switching parsers. It still feels like #293 or even #177 are really big steps to take — this is meant to be a smaller (if not small) step.

I suggest we name this spike "Trifle", because for once, this spike is to be trifled with.

Sublanguage

Let's think up a sublanguage of 007; call it 001. It has (non-negative) integers, identifiers, my declarations, infix +, infix =, expression statements, block statements, if statements, and while statements (but no block parameters). Just large enough to be interesting to extend.

Here's a (Perl 6) grammar describing 001's syntax:

grammar {
    rule TOP { <statementlist> }
    rule statementlist { <statement> *%% ';' }

    rule statement:expr { <expr> }
    rule statement:block { <block> }
    rule statement:if { 'if'» :: <xblock> }
    rule statement:while { 'while'» :: <xblock> }

    rule xblock { <expr> <block> }
    rule block { '{' ~ '}` <statementlist> }
    rule expr { <term> +% <op> }

    rule term:int { \d+ }
    rule term:identifier { <identifier> }
    rule term:my { 'my'» :: <identifier> }
    rule identifier { <[a..zA..Z]>\w* }

    rule op:add { '+' }
    rule op:assign { '=' }
}

(Cute, huh? Perl 6 grammars really shine here. I haven't tested this, so it's probably correct modulo whitespace.)

Although not shown here, it's implied that the parse also gives back a Qtree; same one 007 would. Error handling is not our focus here; a parse that fails simply throws a very boring exception.

Where it gets interesting

Oh, and import statements are also supported in 001. (To 007's slight annoyance.) The syntax allowed is import * from x.y.z; — if the x.y.z is syntax.stmt.until, then an until statement is injected. (If it's anything else, it's a syntax error.)

Importing syntax.stmt.until installs a new statement type statement:until, analogously to statement:while. This is where a parser handover happens, to a new parser with this new grammar rule. The new parser stays in control until block end, at which point it hands control back to the original parser.

(If this works out well, could also try the same, in different orders, with syntax.stmt.unless.)

NFAs

This next section assumes familiarity with the ideas in the regexp1 article; specifically NFAs.

The big job is to have a parser running in 007 that will (a) parse 001 and (b) be extensible. I'd rather generate this 007 code than write it by hand — and besides, there needs to be something that's dynamic enough in the code itself for (b).

So we need to generate an NFA. Does the grammar above easily translate to an NFA? Let's go through things feature by feature:

But then there are some Perl 6-specific tricks that we need to think about separately:

So it looks like we're fine expressing all this as an NFA. The side effects are bending the rules a bit; specifically, it requires that we're no longer in the declarative prefix (and so it's a kind of implicit :: to have a side effect).

masak commented 5 years ago

Missed a biggie.

Not sure how to address this one. We're not getting rid of the recursion, since that's part of the language. In a way it's good that 001 is already complex enough to have recursion; otherwise we'd just have hit it later and been thrown off by it then.

I have two thoughts. One is that the parser handover will need to factor into all this somehow. Not saying those two things will interfere, but they'll need to work together.

Secondly, maybe we'll end up using the state graph not for all parsing, but just for LTM parsing. Something — a hope, maybe — tells me recursion doesn't become a problem if we limit the states to handle declarative prefixes. The rest is behind a "procedural event horizon", and weird things like recursion are allowed to happen there, outside of the pure NFA world.

masak commented 5 years ago

001 requires a semicolon after a non-last blockish statement.

my x = 5;
if x {
    x = 0;
};
x;

As the grammar is specified, that ; after the } is mandatory.

I think I'm fine with that. Technically, 001 is still a strict subset of 007 (in the sense of "it allows fewer productions, but not more"). Being allowed to skip that ; feels like a "nice extra" — it's not in the charter of 001 to be ergonomic in that way. To be honest I'd rather it stay simple.

masak commented 5 years ago

Secondly, maybe we'll end up using the state graph not for all parsing, but just for LTM parsing.

Coming back to this; this shouldn't have been a big reveal, actually. If NFAs were as powerful as grammars, then grammars wouldn't have to make the declarative/procedural distinction in the first place.

Specifically, NFAs (and regular expressions) famously don't parse arbitrarily nested languages, of which even the stupidly simple 001 is one, and certainly 007 and Perl 6. And basically any language with the concepts of blocks in it.

But, keeping our eyes on the ball: the winning condition here isn't "NFAs parse the entire language", it's "NFAs do dispatch into (procedural) rules" plus "NFAs can be extended during the parse". Having said that, I still don't feel I have a handle on exactly how that dispatch works, and how the interaction between parsers and the rules that want to extend them should work.

masak commented 5 years ago

As I was thinking about this while falling asleep yesterday in a jet-lagged haze: there are two parsers, really. The declarative one and the procedural one.

masak commented 5 years ago

Forgot to mention: work is now underway in a branch. Current status: almost at the point of hitting the limits of the declarative parser.

I'm toying with the idea of maybe being able to create the NFA for the declarative parts from the s-expressions of the procedural parts. Execution paths of the world, unite!

masak commented 5 years ago

Here, I tried to capture the thing with the NFA dispatch and the entry points in a set of diagrams. They are messy and probably mostly wrong, but here goes.

IMG_20190409_114609

Preliminary conclusions:

The investigation continues.

masak commented 5 years ago

Trying to figure out how to compute those entry points. Here's me doing it manually:

Ok, so I think that the rule is this: (a) something else is calling the rule (another rule, or in the case of TOP, the parser itself), (b) after flattening, the "first thing" that happens is an alternation (including multi tokens) requiring dispatch — for example, statement is a proto with multis, and it gets flattened into statementlist which is flattened into TOP; (c) all other things equal, two different calls result in two different entry points.

masak commented 5 years ago

There are also three "variants" of the expr rule:

I don't feel I'm very good at describing what's going on here, so I content myself with feeling parts of the elephant one at a time.

masak commented 5 years ago

It feels to me one should be able to build a "continuation graph" from a grammar; for each rule, which rules can it dispatch into. Note that the distinction of "calling down" vs "returning up" has been erased from this graph — it's something that the procedural/stateful part of the parser cares about, but not the declarative/dispatching part.

Here's my attempt to compute this manually for 001:

Mentally, the algorithm feels a bit like this:

Yes, I think this can be mechanized. One thing I'm not sure of is whether it would also make sense (or be necessary, even) to split the rules up...

(getting off the train, thinking of it some more)

...yes, we need to do that. It feels quite similar to CPS-transforming a bit of async/await code. The result will be quite a lot like my doodle above, but with more intermediate states. It's like we're breaking up each rule (multiplied with each context) into "basic blocks", small rule-fragments that eventually end up in an NFA dispatch that leads to another rule-fragment.

masak commented 5 years ago

I think it'd be possible to put together a small proof-of-concept of this algorithm. Like a small mini-spike within the spike.

masak commented 5 years ago

There's some very neat strange consistency going on between multi-method dispatch (figuring out which method to call based on the arguments sent in) and LTM (figuring out which subrule to end up in based on a part of the text currently being parsed).

Also, we could lean on this consistency by (for example) handling the @Override story similarly for both functions and @parsed macros.