Parser/compiler backend for Python itself

jimbaker commented 1 year ago

Change the placeholder to x_Nx, where N is the arg index from the tag string.

Consider the following Python template:

# names defined for below, or bind from globals
my_function_template = code"""
def {function_name}({param1}, {param2}):
    if {my_favorite_boolean_variable}:
        {some_chunk_of_code}
        {some_function()
    {some_other_function}()
"""

This then translates to the placeholder equivalent

def x_0x(x_1x, x_2x):
    if x_3x:
        x_4x
        x_5x()
    x_6x()

which renders to this AST, with print(ast.dump(ast.parse(code), indent=4))

Module(
    body=[
        FunctionDef(
            name='x_0x',
            args=arguments(
                posonlyargs=[],
                args=[
                    arg(arg='x_1x'),
                    arg(arg='x_2x')],
                kwonlyargs=[],
                kw_defaults=[],
                defaults=[]),
            body=[
                If(
                    test=Name(id='x_3x', ctx=Load()),
                    body=[
                        Expr(
                            value=Name(id='x_4x', ctx=Load())),
                        Expr(
                            value=Call(
                                func=Name(id='x_5x', ctx=Load()),
                                args=[],
                                keywords=[]))],
                    orelse=[]),
                Expr(
                    value=Call(
                        func=Name(id='x_6x', ctx=Load()),
                        args=[],
                        keywords=[]))],
            decorator_list=[])],
    type_ignores=[])

Now traverse this AST, such that we build the AST in code. That's easy, it's the same thing, we just need to replace our placeholders with calls to getvalue():

def compiled_tag(runtime, *args):
    return compile(
        # showing just part of the resulting AST
        Expr(
            value=Call(
            func=Name(id=str(args[6].getvalue()), ctx=Load()),
            args=[],
            keywords=[])))

Now it's quite possible that this is still slower than with regular Python parsing from text - which is super fast, and it doesn't Python to build the AST. On the other hand, one could emit to some some sort of mini VM written in Rust or C, which using the CPython API, builds out an AST as above; and by the way, that mini VM would be roughly equivalent to say the cPickle mini VM...

And unlike using text, we can guarantee by building from pieces that they would substitute in correctly.

jimbaker commented 1 year ago

The specific part of the AST tree is actually interesting here. What could we actually substitute?

As is, we have

Call(func=Name(id=str(args[6].getvalue()), ctx=Load()),
...

but this can be further generalized to an expression, given Python's grammar to

Call(expr func, expr* args, keyword* keywords)

(per https://docs.python.org/3/library/ast.html#abstract-grammar)

So this "widening" could come up if args[6].getvalue() evaluated to an Expr (and using a recursive marker class) vs say a str. So for completeness the runtime as I'm calling it should accommodate that case. On the other hand, if the interpolation was just a list of statements, that wouldn't work and should fail at the interpolation.

Python's syntax is complicated. But in general, we should be able to compose in this fashion.

jimbaker commented 1 year ago

Somewhat related: what libraries can be used to turn a modified AST into Python source code? (The roundtripping problem.) This would be useful for caching and debugging.

I found https://github.com/berkerpeksag/astor, which references a library (codegen.py, https://github.com/berkerpeksag/astor#id2) which might be used in Jinja.

It's unclear what version of Python is supported by Astor, but the repo has been updated within the last year.

pauleveritt commented 1 year ago

Multiple good points in here.

First, the whole thing smells a little like AoT. Doing extra work at parse time. From a jargon perspective, am I wrong?

The part about emit to some some sort of mini VM written in Rust or C is really fascinating. Some parts might be implemented as SQL query operations too. (And those might use extension functions in Rust.)

The roundtrip problem is one I haven't understood as well. Particularly: meaningful tracebacks. I had presumed the answer would be either:

Catch at a good boundary and re-raise something meaningful.
Or, some sourcemap like thing mumbo jumbo.

You're right, a roundtrip might help. But how would you get a traceback that correlated the line back to the original?

jimbaker commented 1 year ago

But how would you get a traceback that correlated the line back to the original?

Good question. co_filename and co_firstlineno are available (the latter can be presumably set with https://docs.python.org/3/library/types.html#types.CodeType.replace). In principle, we could index from the corresponding bytecode to the source file with co_lnotab, but that does seem rather difficult.

jimbaker commented 1 year ago

First, the whole thing smells a little like AoT. Doing extra work at parse time. From a jargon perspective, am I wrong?

It is doing extra work upfront, so correct. This extra work also enables greater precision in our semantics, because we can use more of the available context. Example: if I see <{tagname}>...</{tagname}> it's possible for me to add the constraint these are the same tagnames, given that tags open/close each other. To do this constraint inference, I need to have constructed this parse.

pauleveritt commented 1 year ago

Is it also correct that we've lowered the bar for letting people do custom AoT in their DSLs?

Sure, one has access to the Jinja2 AST. But I've worked with it. Not fun. Really not explained much, really uncommon. It feels like the "here be dragons" path -- I'm not even sure if it's a stable API.

The layers that you're discussing -- starting from tagstr, but into your fdom machinery and thinking -- seem to really change the equation. If people want to bake more into the parse step's outputs, they can. Markdown parsing and the local content mode., for example.

jimbaker commented 1 year ago

Is it also correct that we've lowered the bar for letting people do custom AoT in their DSLs?

Sure, one has access to the Jinja2 AST. But I've worked with it. Not fun. Really not explained much, really uncommon. It feels like the "here be dragons" path -- I'm not even sure if it's a stable API.

The layers that you're discussing -- starting from tagstr, but into your fdom machinery and thinking -- seem to really change the equation. If people want to bake more into the parse step's outputs, they can. Markdown parsing and the local content mode., for example.

Yes, this is all true, and it's a good goal of this work.

jimbaker commented 1 year ago

... but that does seem rather difficult.

But it can be made easier: We should be able to use Python to report these bytecode offsets for chunks. (For thunks, we can report the interpolation failure, etc., more directly.) This is because we can capture what the Python compiler did for the VDOM, or other compiled representation, thenusing the placeholders to capture instrumentation. (Think of the placeholders as acting like trail signs when navigating with respect to a map on a hiking trail; and then doing a sys.settrace to capture what is going on.)

So instead of trying to figure out what a specific version of Python would do in its bytecode compilation, we just ask it; and then map those indexes in co_lnotab back to the original source code.

So it's still a reasonably challenging problem.

pauleveritt / fdom

Parser/compiler backend for Python itself #7