Open jimbaker opened 1 year ago
The specific part of the AST tree is actually interesting here. What could we actually substitute?
As is, we have
Call(func=Name(id=str(args[6].getvalue()), ctx=Load()),
...
but this can be further generalized to an expression, given Python's grammar to
Call(expr func, expr* args, keyword* keywords)
(per https://docs.python.org/3/library/ast.html#abstract-grammar)
So this "widening" could come up if args[6].getvalue()
evaluated to an Expr
(and using a recursive marker class) vs say a str
. So for completeness the runtime as I'm calling it should accommodate that case. On the other hand, if the interpolation was just a list of statements, that wouldn't work and should fail at the interpolation.
Python's syntax is complicated. But in general, we should be able to compose in this fashion.
Somewhat related: what libraries can be used to turn a modified AST into Python source code? (The roundtripping problem.) This would be useful for caching and debugging.
I found https://github.com/berkerpeksag/astor, which references a library (codegen.py, https://github.com/berkerpeksag/astor#id2) which might be used in Jinja.
It's unclear what version of Python is supported by Astor, but the repo has been updated within the last year.
Multiple good points in here.
First, the whole thing smells a little like AoT. Doing extra work at parse time. From a jargon perspective, am I wrong?
The part about emit to some some sort of mini VM written in Rust or C
is really fascinating. Some parts might be implemented as SQL query operations too. (And those might use extension functions in Rust.)
The roundtrip problem is one I haven't understood as well. Particularly: meaningful tracebacks. I had presumed the answer would be either:
You're right, a roundtrip might help. But how would you get a traceback that correlated the line back to the original?
But how would you get a traceback that correlated the line back to the original?
Good question. co_filename
and co_firstlineno
are available (the latter can be presumably set with https://docs.python.org/3/library/types.html#types.CodeType.replace). In principle, we could index from the corresponding bytecode to the source file with co_lnotab
, but that does seem rather difficult.
First, the whole thing smells a little like AoT. Doing extra work at parse time. From a jargon perspective, am I wrong?
It is doing extra work upfront, so correct. This extra work also enables greater precision in our semantics, because we can use more of the available context. Example: if I see <{tagname}>...</{tagname}>
it's possible for me to add the constraint these are the same tagnames, given that tags open/close each other. To do this constraint inference, I need to have constructed this parse.
Is it also correct that we've lowered the bar for letting people do custom AoT in their DSLs?
Sure, one has access to the Jinja2 AST. But I've worked with it. Not fun. Really not explained much, really uncommon. It feels like the "here be dragons" path -- I'm not even sure if it's a stable API.
The layers that you're discussing -- starting from tagstr, but into your fdom machinery and thinking -- seem to really change the equation. If people want to bake more into the parse step's outputs, they can. Markdown parsing and the local content mode., for example.
Is it also correct that we've lowered the bar for letting people do custom AoT in their DSLs?
Sure, one has access to the Jinja2 AST. But I've worked with it. Not fun. Really not explained much, really uncommon. It feels like the "here be dragons" path -- I'm not even sure if it's a stable API.
The layers that you're discussing -- starting from tagstr, but into your fdom machinery and thinking -- seem to really change the equation. If people want to bake more into the parse step's outputs, they can. Markdown parsing and the local content mode., for example.
Yes, this is all true, and it's a good goal of this work.
... but that does seem rather difficult.
But it can be made easier: We should be able to use Python to report these bytecode offsets for chunks. (For thunks, we can report the interpolation failure, etc., more directly.) This is because we can capture what the Python compiler did for the VDOM, or other compiled representation, thenusing the placeholders to capture instrumentation. (Think of the placeholders as acting like trail signs when navigating with respect to a map on a hiking trail; and then doing a sys.settrace
to capture what is going on.)
So instead of trying to figure out what a specific version of Python would do in its bytecode compilation, we just ask it; and then map those indexes in co_lnotab
back to the original source code.
So it's still a reasonably challenging problem.
Change the placeholder to
x_Nx
, whereN
is the arg index from the tag string.Consider the following Python template:
This then translates to the placeholder equivalent
which renders to this AST, with
print(ast.dump(ast.parse(code), indent=4))
Now traverse this AST, such that we build the AST in code. That's easy, it's the same thing, we just need to replace our placeholders with calls to
getvalue()
:Now it's quite possible that this is still slower than with regular Python parsing from text - which is super fast, and it doesn't Python to build the AST. On the other hand, one could emit to some some sort of mini VM written in Rust or C, which using the CPython API, builds out an AST as above; and by the way, that mini VM would be roughly equivalent to say the cPickle mini VM...
And unlike using text, we can guarantee by building from pieces that they would substitute in correctly.