Open masak opened 5 years ago
Missed a biggie.
statementlist
→ statement:block
→ block
→ statementlist
. This means that the state graph alone either can't be reified in its entirety (because it's effectively infinite), or that it's not enough to describe the state of the parse.Not sure how to address this one. We're not getting rid of the recursion, since that's part of the language. In a way it's good that 001 is already complex enough to have recursion; otherwise we'd just have hit it later and been thrown off by it then.
I have two thoughts. One is that the parser handover will need to factor into all this somehow. Not saying those two things will interfere, but they'll need to work together.
Secondly, maybe we'll end up using the state graph not for all parsing, but just for LTM parsing. Something — a hope, maybe — tells me recursion doesn't become a problem if we limit the states to handle declarative prefixes. The rest is behind a "procedural event horizon", and weird things like recursion are allowed to happen there, outside of the pure NFA world.
001 requires a semicolon after a non-last blockish statement.
my x = 5;
if x {
x = 0;
};
x;
As the grammar is specified, that ;
after the }
is mandatory.
I think I'm fine with that. Technically, 001 is still a strict subset of 007 (in the sense of "it allows fewer productions, but not more"). Being allowed to skip that ;
feels like a "nice extra" — it's not in the charter of 001 to be ergonomic in that way. To be honest I'd rather it stay simple.
Secondly, maybe we'll end up using the state graph not for all parsing, but just for LTM parsing.
Coming back to this; this shouldn't have been a big reveal, actually. If NFAs were as powerful as grammars, then grammars wouldn't have to make the declarative/procedural distinction in the first place.
Specifically, NFAs (and regular expressions) famously don't parse arbitrarily nested languages, of which even the stupidly simple 001 is one, and certainly 007 and Perl 6. And basically any language with the concepts of blocks in it.
But, keeping our eyes on the ball: the winning condition here isn't "NFAs parse the entire language", it's "NFAs do dispatch into (procedural) rules" plus "NFAs can be extended during the parse". Having said that, I still don't feel I have a handle on exactly how that dispatch works, and how the interaction between parsers and the rules that want to extend them should work.
As I was thinking about this while falling asleep yesterday in a jet-lagged haze: there are two parsers, really. The declarative one and the procedural one.
statementlist
→ statement:block
→ block
→ statementlist
.)Forgot to mention: work is now underway in a branch. Current status: almost at the point of hitting the limits of the declarative parser.
I'm toying with the idea of maybe being able to create the NFA for the declarative parts from the s-expressions of the procedural parts. Execution paths of the world, unite!
Here, I tried to capture the thing with the NFA dispatch and the entry points in a set of diagrams. They are messy and probably mostly wrong, but here goes.
Preliminary conclusions:
expr
and term
have the same entry point since the expr
rule unconditionally starts with <term>
. Almost everything ends up in TOP. (The lower right diagram tries to depict the copying.)statementlist
. Not just because the final ;
is optional, but because it matters whether the statementlist
was called at the "top level" of the program, or inside of a block. In the former case, end-of-file is a valid thing to encounter but }
is not; in the latter case, it's vice versa. In retrospect, this shouldn't be super-surprising, since we have two calls to <statementlist>
in the grammar; and these happen in different "environments". But in some sense it means we need to have two slightly different copies of the NFA graph for statementlist
.expr
and block
are recursive, and NFA dispatches can happen inside of them, but there's no sensible way to graph that because the possibilities are literally countably infinite. The closest I got to showing this was to use slightly rounder nodes for "this is where we give control to the procedural parser".The investigation continues.
Trying to figure out how to compute those entry points. Here's me doing it manually:
TOP
/ statementlist
/ statement
is onestatementlist
/ statement
from inside block
(different stopper)expr
/term
op
Ok, so I think that the rule is this: (a) something else is calling the rule (another rule, or in the case of TOP
, the parser itself), (b) after flattening, the "first thing" that happens is an alternation (including multi tokens) requiring dispatch — for example, statement
is a proto with multis, and it gets flattened into statementlist
which is flattened into TOP
; (c) all other things equal, two different calls result in two different entry points.
There are also three "variants" of the expr
rule:
statement:expr
outside of any block (has eof
as a stopper)statement:expr
inside of a block (has }
as a stopper)xblock
(has {
as a stopper)I don't feel I'm very good at describing what's going on here, so I content myself with feeling parts of the elephant one at a time.
It feels to me one should be able to build a "continuation graph" from a grammar; for each rule, which rules can it dispatch into. Note that the distinction of "calling down" vs "returning up" has been erased from this graph — it's something that the procedural/stateful part of the parser cares about, but not the declarative/dispatching part.
Here's my attempt to compute this manually for 001:
TOP
/statementlist
/statement
→ expr
, block
, statement
, (eof)
TOP
/statementlist
/statement
(in block
) → expr
, block
, statement
xblock
→ expr
, block
, statement,
(eof)`xblock
(in block
) → expr
, block
, statement
expr
/term
→ op
, statement
, (eof)
expr
/term
(in block
) → op
, statement
expr
/term
(in xblock
) → op
, block
identifier
identical to expr
/term
, even though it doesn't include term:int
)op
→ term
Mentally, the algorithm feels a bit like this:
(eof)
is a possible continuation when we fall all the way out of TOP
.Yes, I think this can be mechanized. One thing I'm not sure of is whether it would also make sense (or be necessary, even) to split the rules up...
(getting off the train, thinking of it some more)
...yes, we need to do that. It feels quite similar to CPS-transforming a bit of async
/await
code. The result will be quite a lot like my doodle above, but with more intermediate states. It's like we're breaking up each rule (multiplied with each context) into "basic blocks", small rule-fragments that eventually end up in an NFA dispatch that leads to another rule-fragment.
I think it'd be possible to put together a small proof-of-concept of this algorithm. Like a small mini-spike within the spike.
There's some very neat strange consistency going on between multi-method dispatch (figuring out which method to call based on the arguments sent in) and LTM (figuring out which subrule to end up in based on a part of the text currently being parsed).
Also, we could lean on this consistency by (for example) handling the @Override
story similarly for both functions and @parsed
macros.
The purpose here is to build an MVP which can demonstrate the act of
is parsed
switching parsers. It still feels like #293 or even #177 are really big steps to take — this is meant to be a smaller (if not small) step.Parser<T>
with the sole methodparse(Str): T
. The grammar is (informally) a specification of what the parser accepts, and what result it builds.I suggest we name this spike "Trifle", because for once, this spike is to be trifled with.
Sublanguage
Let's think up a sublanguage of 007; call it 001. It has (non-negative) integers, identifiers,
my
declarations, infix+
, infix=
, expression statements, block statements,if
statements, andwhile
statements (but no block parameters). Just large enough to be interesting to extend.Here's a (Perl 6) grammar describing 001's syntax:
(Cute, huh? Perl 6 grammars really shine here. I haven't tested this, so it's probably correct modulo whitespace.)
Although not shown here, it's implied that the parse also gives back a Qtree; same one 007 would. Error handling is not our focus here; a parse that fails simply throws a very boring exception.
Where it gets interesting
Oh, and import statements are also supported in 001. (To 007's slight annoyance.) The syntax allowed is
import * from x.y.z;
— if thex.y.z
issyntax.stmt.until
, then anuntil
statement is injected. (If it's anything else, it's a syntax error.)Importing
syntax.stmt.until
installs a new statement typestatement:until
, analogously tostatement:while
. This is where a parser handover happens, to a new parser with this new grammar rule. The new parser stays in control until block end, at which point it hands control back to the original parser.(If this works out well, could also try the same, in different orders, with
syntax.stmt.unless
.)NFAs
This next section assumes familiarity with the ideas in the regexp1 article; specifically NFAs.
The big job is to have a parser running in 007 that will (a) parse 001 and (b) be extensible. I'd rather generate this 007 code than write it by hand — and besides, there needs to be something that's dynamic enough in the code itself for (b).
So we need to generate an NFA. Does the grammar above easily translate to an NFA? Let's go through things feature by feature:
+
and ``)**. Also covered by the article. These are just some arrows back.But then there are some Perl 6-specific tricks that we need to think about separately:
+%
and `%)**. Not so bad. For example,
is just sugar for
[. Similarly for
+%. (_Edit_: Actually, it's
, allowing for an extra
;after the last statement. The desugar for that would be
[»
). You simply can't know if you have anif
statement if you've only seenif
in your input; the next character can be a 0x20 space (if
statement) or anotherf
(identifier). But as soon as the NFA sees that next character, it knows to pick betweenif
statement and identifier. So we're good.::
discards other alternative rules that could have matched. I'll be honest and say I'll need to implement this in order to get a feel for it. In the best case, we'll just nuke all the other "current states" except for the present one.'{' ~ '}' <statementlist>
is just syntactic sugar for'{' <statementlist> '}'
, at least when we disregard better error messages and such luxuries.my
registers a new variable, (b) when we enter and exit blocks, and (c) when we importsyntax.stmt.unless
.So it looks like we're fine expressing all this as an NFA. The side effects are bending the rules a bit; specifically, it requires that we're no longer in the declarative prefix (and so it's a kind of implicit
::
to have a side effect).