numba / numba-rvsdg

Numba compatible RVSDG (Regionalized Value State Dependence Graph) utilities.
https://numba-rvsdg.readthedocs.io/
BSD 2-Clause "Simplified" License
18 stars 7 forks source link

Write demonstrator of Python bytecode → Python AST #81

Closed jpivarski closed 6 months ago

jpivarski commented 1 year ago

From https://github.com/numba/numba-rvsdg/issues/80#issuecomment-1610043380, I noticed that it's possible to make RVSDG from Python bytecode already. (I hadn't been paying close attention to all the progress that was being made here, but I just noticed it now.)

This is a note to myself (which I will assign to myself) to write a demonstrator of Python bytecode → Python AST, at least for a subset of simple expressions, to get a sense of how it will go. I understand that the RVSDG development is very much in flux and that the demonstrator will get out of date quickly, but it would be good for me to do a first test-run, to get familiar with RVSDG as a concept and discover what the fundamental issues are.

jpivarski commented 1 year ago

(Oh, I can't assign it to myself, but feel free to assign it to me. I'll do the work in a fork and make it a draft PR, which will probably never get out of draft status or be merged.)

esc commented 1 year ago

@jpivarski I have assigned you.

BTW: the current numba-rvsdg 0.0.2 only supports transformation from Python bytecode to SCFG (structured control flow graph). While the documentation is still a bit sparse at present, we do have:

https://numba-rvsdg.readthedocs.io/en/latest/reference/datastructures.html#numba_rvsdg.core.datastructures.scfg.SCFG

Although the SCFG object has mostly methods for creating the SCFG from python bytecode, rather than introspecting it.

jpivarski commented 1 year ago

Thanks! I figured that at present, there would only be conversions from Python bytecode to something new, such as this SCFG class. Through this project, I'm hoping to learn about the new code representation, even if the API changes and I'll have to rewrite it later.

jpivarski commented 1 year ago

I started investigating it, and learned that not just any bytecode will produce a graph—only bytecode with control flow.

from numba_rvsdg.core.datastructures.byte_flow import ByteFlow

def foo(x, y):
    return x + y

byteflow = ByteFlow.from_bytecode(foo.__code__)

print(byteflow.scfg.to_dict())

produces

{'python_bytecode_block_0': {'jt': []}}

but

def foo(num, start):
    out = start
    for i in range(num):
        out += i**2
    return out

produces

{
    "python_bytecode_block_0": {"jt": ["python_bytecode_block_1"]},
    "python_bytecode_block_1": {"jt": ["python_bytecode_block_2", "python_bytecode_block_3"]},
    "python_bytecode_block_2": {"jt": ["python_bytecode_block_1"]},
    "python_bytecode_block_3": {"jt": []},
}

It mainly seems to be grouping bytecodes in a useful way, perhaps indicating data dependencies. Still going with that second example,

bcmap = byteflow.scfg.bcmap_from_bytecode(byteflow.bc)

for i in range(4):
    print(f"python_bytecode_block_{i}")
    python_bytecode_block = byteflow.scfg.graph[f"python_bytecode_block_{i}"]
    for instruction in python_bytecode_block.get_instructions(bcmap):
        print(f"    {instruction.opname} {instruction.arg} {instruction.argval}")

produces

python_bytecode_block_0
    LOAD_FAST 1 start
    STORE_FAST 2 out
    LOAD_GLOBAL 0 range
    LOAD_FAST 0 num
    CALL_FUNCTION 1 1
    GET_ITER None None
python_bytecode_block_1
    FOR_ITER 8 30
python_bytecode_block_2
    STORE_FAST 3 i
    LOAD_FAST 2 out
    LOAD_FAST 3 i
    LOAD_CONST 1 2
    BINARY_POWER None None
    INPLACE_ADD None None
    STORE_FAST 2 out
    JUMP_ABSOLUTE 6 12
python_bytecode_block_3
    LOAD_FAST 2 out
    RETURN_VALUE None None

while

import dis
print(dis.dis(foo))

produces

  6           0 LOAD_FAST                1 (start)
              2 STORE_FAST               2 (out)

  7           4 LOAD_GLOBAL              0 (range)
              6 LOAD_FAST                0 (num)
              8 CALL_FUNCTION            1
             10 GET_ITER
        >>   12 FOR_ITER                 8 (to 30)
             14 STORE_FAST               3 (i)

  8          16 LOAD_FAST                2 (out)
             18 LOAD_FAST                3 (i)
             20 LOAD_CONST               1 (2)
             22 BINARY_POWER
             24 INPLACE_ADD
             26 STORE_FAST               2 (out)
             28 JUMP_ABSOLUTE            6 (to 12)

  9     >>   30 LOAD_FAST                2 (out)
             32 RETURN_VALUE

Looking at it with

from numba_rvsdg.rendering.rendering import ByteFlowRenderer

bfr = ByteFlowRenderer()
bfr.bcmap_from_bytecode(byteflow.bc)
byteflow.scfg.view("scfg")

produces

I know that I can convert expressions from bytecode to AST; the control flow was always the hard part. This graph must be capturing the control flow in some way, though I'll need to play with it and maybe ask some specific questions to understand how.

Now I'm going to try running through a complete set of Python expressions (no statements) to see if this always gives me single-node graphs, as I would expect.

jpivarski commented 1 year ago

Hypothesis: all Python expressions produce an SCFG graph of length 1.

Result: yup, they do.

import ast

from numba_rvsdg.core.datastructures.byte_flow import ByteFlow

# all of the expression AST node types
expression_types = [
    x
    for x in dir(ast)
    if (
            isinstance(getattr(ast, x), type)
            and issubclass(getattr(ast, x), ast.expr)
            and x not in (
                "expr",

                "Num",           # deprecated since Python 3.8
                "Str",           #   |
                "Bytes",         #   |
                "NameConstant",  #   |
                "Ellipsis",      #   V
            )
    )
]

# remove an expression node type when we see one
class RemoveFromList(ast.NodeVisitor):
    def visit(self, node):
        if type(node).__name__ in expression_types:
            expression_types.remove(type(node).__name__)
        return super().visit(node)

remove_from_list = RemoveFromList()

# examples of all expression types in Python
for string_expression in [
        "x.y",                                      # Attribute
        "await f()",                                # Await
        "x + y",                                    # BinOp
        "x and y",                                  # BoolOp
        "f()",                                      # Call
        "x < y",                                    # Compare
        "123",                                      # Constant
        "{1: 2}",                                   # Dict
        "{x: x for x in collection}",               # DictComp
        "f'{x}'",                                   # FormattedValue
        "(x for x in collection)",                  # GeneratorExp
        "consequent if predicate else alternate",   # IfExp
        "f'x'",                                     # JoinedStr
        "lambda: None",                             # Lambda
        "[1, 2, 3]",                                # List
        "[x for x in collection]",                  # ListComp
        "x",                                        # Name
        "(x := 4)",                                 # NamedExpr
        "{1, 2, 3}",                                # Set
        "{x for x in collection}",                  # SetComp
        "x[1:2]",                                   # Slice
        "(x, *y)",                                  # Starred
        "x[1]",                                     # Subscript
        "(1, 2, 3)",                                # Tuple
        "~x",                                       # UnaryOp
        "yield x",                                  # Yield
        "yield from x",                             # YieldFrom
]:
    # has to be wrapped in a function so that Await is valid
    asyncy = "async " if "await " in string_expression else ""
    syntax_tree = ast.parse(f"{asyncy}def _(): {string_expression}")

    # remove from the list of expression types
    remove_from_list.visit(syntax_tree)

    code = compile(syntax_tree, "<string>", "exec")

    # hypothesis: SCFG for all bare expressions is a one-node graph
    byteflow = ByteFlow.from_bytecode(code)
    assert len(byteflow.scfg.graph) == 1

# did we miss any expression types?
assert len(expression_types) == 0
esc commented 1 year ago

@jpivarski what you have discovered is correct. Any linear sequence of bytecodes without if-else or for constructs will produce only a single PythonBytecodeBlock. As soon as control flow or looping are introduced you get multiple blocks. You can assume that any PythonBytecodeBlock contains a linear sequence of Python bytecodes / expressions.

esc commented 1 year ago

@jpivarski you can also run the restructure method from https://numba-rvsdg.readthedocs.io/en/latest/reference/datastructures.html#numba_rvsdg.core.datastructures.byte_flow.ByteFlow.restructure

This will apply the LOOP-RESTRUCTURE and BRANCH-RESTRUCTURE algorithms from sections 4.1 and 4.2 of:

https://dl.acm.org/doi/pdf/10.1145/2693261

and then render the graph to see what it looks like.

The main goal of the RVSDG package right now is to transform Python bytecode into an SCFG which can be restructured.

Note also, that some of the top-level classes are scheduled for removal in:

https://github.com/numba/numba-rvsdg/issues/75

But it shouldn't be that big of a change.

jpivarski commented 1 year ago

Any linear sequence of bytecodes without if-else or for constructs will produce only a single PythonBytecodeBlock. As soon as control flow or looping are introduced you get multiple blocks.

If you get graph structures for if-else and for, I think there would also be graph structures for while.

For the following statement types, here's what I think about whether they'll split into multiple graph nodes. (YES = you said that they split, MAYBE = I think they would, and NO = I don't think they would; I think several of these can be part of the same graph node.)

The main goal of the RVSDG package right now is to transform Python bytecode into an SCFG which can be restructured.

It's probably the case that I'll only want to look at reconstructed graphs, then. Right now, I'm still getting a basic understanding of these things.

I wonder if I should first focus on the parts that aren't spread across graph nodes. The parsing rules from uncompyle6 would still be useful here, applied to each graph node as an individual sequence of instructions, since they still need to be turned into AST trees. It would be like this, which targets Python expressions only, but making ast.AST nodes instead of the rejig.syntaxtree thing in that old demo.

As for the control flow parts that are spread across graph nodes, they could be reconstructed into for, if, while, etc. from the restructured graph (somehow) instead of using uncompyle6.

And maybe I shouldn't be thinking of this project as using uncompyle6, but maybe being part of uncompyle6, as a way to help @rocky get that tool up-to-date with recent Pythons.

(Hi, @rocky, bringing you into this conversation for the first time. I've been thinking about the problem of converting bytecode → AST for recent versions of Python, which is also a problem for Numba and this RVSDG formalism promises to fix it. I've been doing these tests to see if the bytecode interpretation part can remain independent of LLVM/pure Python. Should we talk on https://github.com/rocky/python-uncompyle6/discussions about whether you'd want to fold that into your project? This is all very early/preliminary/just seeing what's possible.)

Note also, that some of the top-level classes are scheduled for removal in:

75

But it shouldn't be that big of a change.

I understand that it's in flux. I'll keep doing git pull as things change.

rocky commented 1 year ago

@jpivarski - I am interested, but it will take me some time for me to get up to speed.

rocky commented 1 year ago

Some initial comments. Overall I think the approach or idea is sound, although this is going to be a lot of work.

(Believe it or not, that the approach is sound is a big deal. I have read about other approaches that feel a little whacky and not that sound.)

A couple of observations though when I look at https://github.com/diana-hep/rejig/blob/master/reinterpreted-python/rejig/pybytecode.py

If this is to be able to work on more than one Python bytecode, you will probably want to structure in a way that segregates versions.

There are basically two approaches - one that treats each Python version as its own separate entity, and one accommodates sharing between versions. For accommodating sharing between versions, the approach I have been using has been something like version Python 3.9 is like 3.8 except here and here and here. In other words we say that this for kind of rule or action change it it such and such a way.

Personally I have been moving away from the shared approach on the grammar or parse side because it was just too complicated to keep track of things. So here I just cut and paste. If I then need to adjust okay.

On the semantic action side though there is still sharing going on. See https://github.com/rocky/python-uncompyle6/blob/master/uncompyle6/semantics/customize3.py#L354-L376

One other important thing to understand and keep in mind is the difference between uncompyle6, decompyle3, and this yet another not-public fork called decompile-cfg.

In theory and ideally, there is no reason there could not be one code base for all versions. The problem has always been the amount of work needed to back-port stuff.

There is a lot of work needed still to backport decompyle3 code to uncompyle6. And although you can't see decompile-cfg, the same is true there to backport to decompyle3. (And the decompile-cfg code is still changing a bit on its own.)

So for my own use such as in the trepan3k debugger, I import the right code base of these depending on version: for 3.7 and 3.8 I use decompyle3. Otherwise uncompyle6.

On recent and big change in the way I think about this, is that I am now doing more in the way of abstracting the parse tree that I get from a parse of tokenized bytecode instructions. See https://github.com/rocky/python-uncompyle6/discussions/412#discussioncomment-5563422

This has a big and useful impact in that when we get to the semantic actions there is less junk in the Tree, and that means things will be more then same across different Python versions (and Python bytecode versions). In the semantic actions we have to go by position in the tree - 0th item, 1st item and so on. Again, reducing the noise or pseudo operations that don't have any real bytecode in them makes different versions or different derivations in the same Python version look more similar.

You will find a little bit of this in https://github.com/rocky/python-decompile3/blob/master/decompyle3/semantics/transform.py#L30-L67 and https://github.com/rocky/python-decompile3/blob/master/decompyle3/semantics/transform.py#L558

rocky commented 1 year ago

In the list above for branching ops there are:

jpivarski commented 1 year ago

Having looked around the uncompyle6 codebase (though not the other ones), I'm aware of how big of a project it is. What I'm hoping is that we can take advantage of the modularization of Numba to get some parts from shared effort, in particular the hardest parts. After all, Numba has to unravel the control structures, too, though Numba does not need to parse instructions into an AST. (I think Numba may be able to do a Python instruction → LLVM instruction mapping without knowing the tree structure of expressions.)[^1]

Separating by Python versions sounds like a good idea to me. Maybe the parsing code should be the same and each Python version should have its own set of grammar rules, copying where necessary? I say that because it's how I had thought that uncompyle6 works. I saw that there was a lot of separation between the Python versions already, and I had assumed that it went all the way down to the grammar rules.

In the list above for branching ops there are:

That's a good point. In https://github.com/numba/numba-rvsdg/issues/81#issuecomment-1636549934, these formed a single graph node, but they ought to be control flow (including and/or). Maybe that's still to be addressed.

[^1]: We should also keep in mind that the Numba team has a large project in front of them, and ensuring that Python bytecode → AST is not a goal of that project. I'm hoping to piggy-back this on it, since it seems so promising.

rocky commented 1 year ago

I think I am confused. What would you like to do, and how do you propose doing it? I see code, but perhaps I don't understand to bigger picture.

We should also keep in mind that the Numba team has a large project in front of them, and ensuring that Python bytecode → AST is not a goal of that project. I'm hoping to piggy-back this on it, since it seems so promising.

Python bytecode to Python AST (or any sort of equivalent AST) is about as hard as Python bytecode to source code - no easier and no more difficult.

The reality is that with respect to source code is that I have not been able to keep up. I have been doing this as a hobby is something that would keep a couple of people fully occupied full time as a paid activity.

So I am confused on what you expect to see. A proof of concept? Being satisfied working in some limited context? e.g. for basic blocks of only? (Then why RVSDS?) If a limited context, what limited context?

Maybe the parsing code should be the same and each Python version should have its own set of grammar rules, copying where necessary? I say that because it's how I had thought that uncompyle6 works.

Yes, that is how uncompyle6 works. But not in the newer, not public decompiler that works for 3.8-3.10 though. Here there is no grammar sharing. I mention this though as kind of an aside. decompyle3 is in the middle - I started moving in this direction.

The main thing here on parsing is that is important, is that over time there is too much drift in bytecode to have one piece of code that has tests for all the variations or a grammar that is a superset grammar of all the versions. Instead, it has been conceptually easier to separate these into separate pieces of code for a single bytecode version.

jpivarski commented 1 year ago

The bigger picture isn't clear yet. As a first step, I want to see if this is going to be possible, so it's a proof of concept. Beyond that, if it does work, it could be

  1. integrated into uncompyle6 or uncompyle3 (whichever is the one with the long-term future), and then uncompyle would depend on numba-rvsdg and not LLVM,
  2. a package that depends on numba-rvsdg and uncompyle,
  3. a package that depends on numba-rvsdg only,
  4. nothing at all, because maybe you've got another way to address Python bytecode changes that doesn't get any help from numba-rvsdg.

I think it would be better overall if there's only one package that provides Python bytecode → AST, though option (1) would have to be something you're interested in, too, if it's going to go forward.

Still, the first thing (for me) to do is the proof of concept, to know if this is even a good approach. What scope? definitely more than basic blocks only, though I was thinking of starting with that and using uncompyle6 for it, exactly because it's an already-solved problem. Then, when addressing something like an if statement or a for loop, I'd have a framework to fit it in.

Python AST (or any sort of equivalent AST) is about as hard as Python bytecode to source code - no easier and no more difficult.

Especially because ast.unparse is in the Python standard library.

not public decompiler that works for 3.8-3.10

That's interesting, and it's the reason I added option (4) above. I didn't know that you were ready to make this leap.

I have been doing this as a hobby is something that would keep a couple of people fully occupied full time as a paid activity.

That's why I thought that using numba-rvsdg was a good idea: it will be kept up-to-date with Python bytecode changes, and the Numba team is large enough that I know this will happen.

rocky commented 1 year ago

For a prototype, I'd start with https://github.com/rocky/python-decompile3 which handles 3.7 and 3.8 code only. The code style is generally more modern Python and it follows better the ideas in the branch that I believe addresses control flow issues.

I have written a bit about recent thoughts and work as to how to do decompilation in https://github.com/rocky/python-uncompyle6/discussions/412

esc commented 1 year ago

Any linear sequence of bytecodes without if-else or for constructs will produce only a single PythonBytecodeBlock. As soon as control flow or looping are introduced you get multiple blocks.

If you get graph structures for if-else and for, I think there would also be graph structures for while.

Oh, so sorry for the misunderstanding, for means "loop" and that includes while.

esc commented 1 year ago

For a prototype, I'd start with https://github.com/rocky/python-decompile3 which handles 3.7 and 3.8 code only. The code style is generally more modern Python and it follows better the ideas in the branch that I believe addresses control flow issues.

numba-rvsdg supports 3.11 only.

rocky commented 1 year ago

I just cloned numba-rvsdg and built it. The project seems to be pretty new - less than a year old.

It seems similar to https://github.com/rocky/python-control-flow , although the end goal is to produce a RSVDG rather than a control-flow graph with dominator and reverse dominator information.

If the future of Python is like its past, then if this project is to be long-lived it will have to come up with a strategy for supporting more Python versions than 3.11. Initially and the path of least resistance, is to assume that the next version bytecode won't be that different from the last.

That generally happens. But periodically it doesn't. Python went from a variable 1 or 3 bytecode format to a fixed 2-byte "wordcode" format, I think between 3.5 and 3.6. Nowadays this kind of change wouldn't be a problem, because I think Python now has abstracted instructions from bytecode. But back in the 3.5 to 3.6 change, I don't believe that was the case. And although I haven't looked at the details all that much, Python 3.11 is a bit different from 3.10.

Assuming 3.12 will be not that different from 3.11 there is the question whether you'll allow handing a different bytecode from the bytecode used by Python interpreter running numba-rvsdg. If so, then I believe you will need to use something other than the Python dis module. And that's why I wrote and use xdis instead.

But as I said, this kind of complication and decision on how to handle more than one bytecode is definitely something that can be put off for later.

Now that I understand numba-rvsdg better, let me close with how its overall approach to understanding control flow is different from the decompiling approach in uncompyle6 and its descendants.

uncompyle6 leverages a couple features of the high-level bytecode interpreter approach (also found in Ruby, Smalltalk, Lua, various Lisps like Emacs Lisp). First, for all practical purposes the translation to bytecode is done by one kind of parsing system for any given version. (What happens afterwards can vary with say pyston or pypy). The fact that there is just one source code to bytecode translator reduces variation in how a given construct might be conceivably be represented in bytecode: the interpreter picks largely one way and uses that.

Given this, it becomes easier to use parsing technology that is similar (but not the same) as in conventional compilers. Because the translation is like DNA and has its quirks and leaves fingerprints all over the place, we can use those quirks to retrieve more often the exact Python construct that was used, rather than some other construct or sequence that is semantically equivalent.

The control-flow to higher-level construct used in numba-rvsdg, while it is more conventional and is general purpose, does not make use of these kinds of clues that specific compilers put in. Uncompyle6 can distinguish if a and b: from if a: if b: in some bytecode versions when he compiler produces different bytecode for these. (Nowadays this happens less often).

Python, in particular, has this rich and whacky set of control-flow structures, such as the "else" branches on "for", "while" and "try" statements. I doubt numba-rsvg would bother to go through the effort needed to look for these as opposed to trying to write these along simpler and more basic control flow structures involving if's wrapped around something else. Should it try to tackle "try ... else .. finally" I think you'll find the control-flow algorithm getting a bit more complicated.

In sum, right now the two approaches to handling control flow, while each is valid, is a bit different from the other.

sklam commented 1 year ago

Thanks @rocky for the detailed insights.

The control-flow to higher-level construct used in numba-rvsdg, while it is more conventional and is general purpose, does not make use of these kinds of clues that specific compilers put in. Uncompyle6 can distinguish if a and b: from if a: if b: in some bytecode versions when he compiler produces different bytecode for these. (Nowadays this happens less often).

Yes, unlikely uncompyle6, the goal for numba-rvsdg is not to provide a syntactic equivalent (AST recovery) but instead to provide a semantic equivalent. For python compilers (particularly Numba), syntactic information is not important. The semantic equivalent in rvsdg will provide compilers a nice data-centric view (where control-flow is more or less implied) that makes compiler passes a lot easier to write. Also, we don't want to rely on the bytecode "clues" and "quirks" because they change overtime. The heuristic based approach is a huge burden in the current Numba bytecode frontend and we anticipate more bytecode changes due to the faster-cpython effort, which will be adding more bytecode level optimizations.

Python, in particular, has this rich and whacky set of control-flow structures, such as the "else" branches on "for", "while" and "try" statements. I doubt numba-rsvg would bother to go through the effort needed to look for these as opposed to trying to write these along simpler and more basic control flow structures involving if's wrapped around something else. Should it try to tackle "try ... else .. finally" I think you'll find the control-flow algorithm getting a bit more complicated.

We will need to tackle all whacky python control-flow to reach parity with what the Numba bytecode frontend currently support. The good thing is that for, while, and else are modeled by jumps/branches in the bytecode so no additional work is needed. With Python>=3.11, try-except-finally are encoded in a side table in __code__.co_exceptiontable. The plan is to represent try/except as specific regions so the exception-flow are again implied.

jpivarski commented 1 year ago

@gordonwatts

esc commented 6 months ago

FWIW, the following PR does go from SCFG -> AST -- but the SCFG must be generated from a Python AST not bytecode:

https://github.com/numba/numba-rvsdg/pull/127

I suppose however that the SCFG2ASTTransformer could be a good place to start going from bytecode back to an AST. Essentially the control-flow constructs are already generated, but the bytecode that describes assignments and expressions would also need to be translated into corresponding AST constructs. So anyway, my implementation listed above does about 50% of the work. Maybe it could be of benefit to someone else.

rocky commented 6 months ago

@esc I had forgotten about his PR. Since 2023, I have been thinking and working on this, and after a few false starts, I have come to the belief that the way to handle control flow in high-level bytecode languages that are grammar based or LLM based is not by selecting the control flow but instead by marking dominator regions in basic blocks and let the grammar or LLM do the actual language-specific control-flow selection.

Basically, this is what https://github.com/rocky/python-control-flow does. Most of it is generic in terms of computing the basic blocks and dominator regions. There is a step at the end that is very specific to marking up the bytecode so that a grammar based approach can use it.

Looking again at the PR based on my current understanding, I am somewhat negatively disposed to the kind of approach in this PR. From a code size perspective alone, this is too small. It is like trying to fit 100 cubic meters of information in a 10 cubic-meter box.

rocky commented 6 months ago

Also, I forgot to mention that I recently gave a talk at BlackHat Asia 2024 that briefly at the end explained some of the thinking here. A link to the slide fragments and text can be found at https://rocky.github.io/blackhat-asia-2024-additional/all-notes-print.html, specifically https://rocky.github.io/blackhat-asia-2024-additional/all-notes-print.html#cfg . Eventually, the video of this taken should be publically available. (Right now I think you can get it if you were registered for the conference or pay some fee to see videos)

jpivarski commented 6 months ago

Thanks for pointers to follow-up, @esc and @rocky! Both of these are better ideas than the work I started here, so I'm going to close this issue and see how your work develops. If I find a way to get back into this, I'll see if I can help.