Closed jpivarski closed 6 months ago
(Oh, I can't assign it to myself, but feel free to assign it to me. I'll do the work in a fork and make it a draft PR, which will probably never get out of draft status or be merged.)
@jpivarski I have assigned you.
BTW: the current numba-rvsdg 0.0.2 only supports transformation from Python bytecode to SCFG (structured control flow graph). While the documentation is still a bit sparse at present, we do have:
Although the SCFG
object has mostly methods for creating the SCFG from python bytecode, rather than introspecting it.
Thanks! I figured that at present, there would only be conversions from Python bytecode to something new, such as this SCFG class. Through this project, I'm hoping to learn about the new code representation, even if the API changes and I'll have to rewrite it later.
I started investigating it, and learned that not just any bytecode will produce a graph—only bytecode with control flow.
from numba_rvsdg.core.datastructures.byte_flow import ByteFlow
def foo(x, y):
return x + y
byteflow = ByteFlow.from_bytecode(foo.__code__)
print(byteflow.scfg.to_dict())
produces
{'python_bytecode_block_0': {'jt': []}}
but
def foo(num, start):
out = start
for i in range(num):
out += i**2
return out
produces
{
"python_bytecode_block_0": {"jt": ["python_bytecode_block_1"]},
"python_bytecode_block_1": {"jt": ["python_bytecode_block_2", "python_bytecode_block_3"]},
"python_bytecode_block_2": {"jt": ["python_bytecode_block_1"]},
"python_bytecode_block_3": {"jt": []},
}
It mainly seems to be grouping bytecodes in a useful way, perhaps indicating data dependencies. Still going with that second example,
bcmap = byteflow.scfg.bcmap_from_bytecode(byteflow.bc)
for i in range(4):
print(f"python_bytecode_block_{i}")
python_bytecode_block = byteflow.scfg.graph[f"python_bytecode_block_{i}"]
for instruction in python_bytecode_block.get_instructions(bcmap):
print(f" {instruction.opname} {instruction.arg} {instruction.argval}")
produces
python_bytecode_block_0
LOAD_FAST 1 start
STORE_FAST 2 out
LOAD_GLOBAL 0 range
LOAD_FAST 0 num
CALL_FUNCTION 1 1
GET_ITER None None
python_bytecode_block_1
FOR_ITER 8 30
python_bytecode_block_2
STORE_FAST 3 i
LOAD_FAST 2 out
LOAD_FAST 3 i
LOAD_CONST 1 2
BINARY_POWER None None
INPLACE_ADD None None
STORE_FAST 2 out
JUMP_ABSOLUTE 6 12
python_bytecode_block_3
LOAD_FAST 2 out
RETURN_VALUE None None
while
import dis
print(dis.dis(foo))
produces
6 0 LOAD_FAST 1 (start)
2 STORE_FAST 2 (out)
7 4 LOAD_GLOBAL 0 (range)
6 LOAD_FAST 0 (num)
8 CALL_FUNCTION 1
10 GET_ITER
>> 12 FOR_ITER 8 (to 30)
14 STORE_FAST 3 (i)
8 16 LOAD_FAST 2 (out)
18 LOAD_FAST 3 (i)
20 LOAD_CONST 1 (2)
22 BINARY_POWER
24 INPLACE_ADD
26 STORE_FAST 2 (out)
28 JUMP_ABSOLUTE 6 (to 12)
9 >> 30 LOAD_FAST 2 (out)
32 RETURN_VALUE
Looking at it with
from numba_rvsdg.rendering.rendering import ByteFlowRenderer
bfr = ByteFlowRenderer()
bfr.bcmap_from_bytecode(byteflow.bc)
byteflow.scfg.view("scfg")
produces
I know that I can convert expressions from bytecode to AST; the control flow was always the hard part. This graph must be capturing the control flow in some way, though I'll need to play with it and maybe ask some specific questions to understand how.
Now I'm going to try running through a complete set of Python expressions (no statements) to see if this always gives me single-node graphs, as I would expect.
Hypothesis: all Python expressions produce an SCFG graph of length 1.
Result: yup, they do.
import ast
from numba_rvsdg.core.datastructures.byte_flow import ByteFlow
# all of the expression AST node types
expression_types = [
x
for x in dir(ast)
if (
isinstance(getattr(ast, x), type)
and issubclass(getattr(ast, x), ast.expr)
and x not in (
"expr",
"Num", # deprecated since Python 3.8
"Str", # |
"Bytes", # |
"NameConstant", # |
"Ellipsis", # V
)
)
]
# remove an expression node type when we see one
class RemoveFromList(ast.NodeVisitor):
def visit(self, node):
if type(node).__name__ in expression_types:
expression_types.remove(type(node).__name__)
return super().visit(node)
remove_from_list = RemoveFromList()
# examples of all expression types in Python
for string_expression in [
"x.y", # Attribute
"await f()", # Await
"x + y", # BinOp
"x and y", # BoolOp
"f()", # Call
"x < y", # Compare
"123", # Constant
"{1: 2}", # Dict
"{x: x for x in collection}", # DictComp
"f'{x}'", # FormattedValue
"(x for x in collection)", # GeneratorExp
"consequent if predicate else alternate", # IfExp
"f'x'", # JoinedStr
"lambda: None", # Lambda
"[1, 2, 3]", # List
"[x for x in collection]", # ListComp
"x", # Name
"(x := 4)", # NamedExpr
"{1, 2, 3}", # Set
"{x for x in collection}", # SetComp
"x[1:2]", # Slice
"(x, *y)", # Starred
"x[1]", # Subscript
"(1, 2, 3)", # Tuple
"~x", # UnaryOp
"yield x", # Yield
"yield from x", # YieldFrom
]:
# has to be wrapped in a function so that Await is valid
asyncy = "async " if "await " in string_expression else ""
syntax_tree = ast.parse(f"{asyncy}def _(): {string_expression}")
# remove from the list of expression types
remove_from_list.visit(syntax_tree)
code = compile(syntax_tree, "<string>", "exec")
# hypothesis: SCFG for all bare expressions is a one-node graph
byteflow = ByteFlow.from_bytecode(code)
assert len(byteflow.scfg.graph) == 1
# did we miss any expression types?
assert len(expression_types) == 0
@jpivarski what you have discovered is correct. Any linear sequence of bytecodes without if-else
or for
constructs will produce only a single PythonBytecodeBlock
. As soon as control flow or looping are introduced you get multiple blocks. You can assume that any PythonBytecodeBlock
contains a linear sequence of Python bytecodes / expressions.
@jpivarski you can also run the restructure
method from https://numba-rvsdg.readthedocs.io/en/latest/reference/datastructures.html#numba_rvsdg.core.datastructures.byte_flow.ByteFlow.restructure
This will apply the LOOP-RESTRUCTURE
and BRANCH-RESTRUCTURE
algorithms from sections 4.1 and 4.2 of:
https://dl.acm.org/doi/pdf/10.1145/2693261
and then render the graph to see what it looks like.
The main goal of the RVSDG package right now is to transform Python bytecode into an SCFG which can be restructured.
Note also, that some of the top-level classes are scheduled for removal in:
https://github.com/numba/numba-rvsdg/issues/75
But it shouldn't be that big of a change.
Any linear sequence of bytecodes without
if-else
orfor
constructs will produce only a singlePythonBytecodeBlock
. As soon as control flow or looping are introduced you get multiple blocks.
If you get graph structures for if
-else
and for
, I think there would also be graph structures for while
.
For the following statement types, here's what I think about whether they'll split into multiple graph nodes. (YES = you said that they split, MAYBE = I think they would, and NO = I don't think they would; I think several of these can be part of the same graph node.)
__code__
object__code__
object__code__
objectasync for
or not is a different matter)for
splits nodes, while
should, tooif
splits nodes, match
should, tootry
-except
(same comment about Numba supporting it, though)try
-except
(same comment about Numba supporting it, though)for
and while
logicfor
and while
logicThe main goal of the RVSDG package right now is to transform Python bytecode into an SCFG which can be restructured.
It's probably the case that I'll only want to look at reconstructed graphs, then. Right now, I'm still getting a basic understanding of these things.
I wonder if I should first focus on the parts that aren't spread across graph nodes. The parsing rules from uncompyle6 would still be useful here, applied to each graph node as an individual sequence of instructions, since they still need to be turned into AST trees. It would be like this, which targets Python expressions only, but making ast.AST
nodes instead of the rejig.syntaxtree
thing in that old demo.
As for the control flow parts that are spread across graph nodes, they could be reconstructed into for
, if
, while
, etc. from the restructured graph (somehow) instead of using uncompyle6.
And maybe I shouldn't be thinking of this project as using uncompyle6, but maybe being part of uncompyle6, as a way to help @rocky get that tool up-to-date with recent Pythons.
(Hi, @rocky, bringing you into this conversation for the first time. I've been thinking about the problem of converting bytecode → AST for recent versions of Python, which is also a problem for Numba and this RVSDG formalism promises to fix it. I've been doing these tests to see if the bytecode interpretation part can remain independent of LLVM/pure Python. Should we talk on https://github.com/rocky/python-uncompyle6/discussions about whether you'd want to fold that into your project? This is all very early/preliminary/just seeing what's possible.)
Note also, that some of the top-level classes are scheduled for removal in:
75
But it shouldn't be that big of a change.
I understand that it's in flux. I'll keep doing git pull
as things change.
@jpivarski - I am interested, but it will take me some time for me to get up to speed.
Some initial comments. Overall I think the approach or idea is sound, although this is going to be a lot of work.
(Believe it or not, that the approach is sound is a big deal. I have read about other approaches that feel a little whacky and not that sound.)
A couple of observations though when I look at https://github.com/diana-hep/rejig/blob/master/reinterpreted-python/rejig/pybytecode.py
If this is to be able to work on more than one Python bytecode, you will probably want to structure in a way that segregates versions.
There are basically two approaches - one that treats each Python version as its own separate entity, and one accommodates sharing between versions. For accommodating sharing between versions, the approach I have been using has been something like version Python 3.9 is like 3.8 except here and here and here. In other words we say that this for kind of rule or action change it it such and such a way.
Personally I have been moving away from the shared approach on the grammar or parse side because it was just too complicated to keep track of things. So here I just cut and paste. If I then need to adjust okay.
On the semantic action side though there is still sharing going on. See https://github.com/rocky/python-uncompyle6/blob/master/uncompyle6/semantics/customize3.py#L354-L376
One other important thing to understand and keep in mind is the difference between uncompyle6, decompyle3, and this yet another not-public fork called decompile-cfg.
In theory and ideally, there is no reason there could not be one code base for all versions. The problem has always been the amount of work needed to back-port stuff.
There is a lot of work needed still to backport decompyle3 code to uncompyle6. And although you can't see decompile-cfg, the same is true there to backport to decompyle3. (And the decompile-cfg code is still changing a bit on its own.)
So for my own use such as in the trepan3k debugger, I import the right code base of these depending on version: for 3.7 and 3.8 I use decompyle3. Otherwise uncompyle6.
On recent and big change in the way I think about this, is that I am now doing more in the way of abstracting the parse tree that I get from a parse of tokenized bytecode instructions. See https://github.com/rocky/python-uncompyle6/discussions/412#discussioncomment-5563422
This has a big and useful impact in that when we get to the semantic actions there is less junk in the Tree, and that means things will be more then same across different Python versions (and Python bytecode versions). In the semantic actions we have to go by position in the tree - 0th item, 1st item and so on. Again, reducing the noise or pseudo operations that don't have any real bytecode in them makes different versions or different derivations in the same Python version look more similar.
You will find a little bit of this in https://github.com/rocky/python-decompile3/blob/master/decompyle3/semantics/transform.py#L30-L67 and https://github.com/rocky/python-decompile3/blob/master/decompyle3/semantics/transform.py#L558
In the list above for branching ops there are:
Having looked around the uncompyle6 codebase (though not the other ones), I'm aware of how big of a project it is. What I'm hoping is that we can take advantage of the modularization of Numba to get some parts from shared effort, in particular the hardest parts. After all, Numba has to unravel the control structures, too, though Numba does not need to parse instructions into an AST. (I think Numba may be able to do a Python instruction → LLVM instruction mapping without knowing the tree structure of expressions.)[^1]
Separating by Python versions sounds like a good idea to me. Maybe the parsing code should be the same and each Python version should have its own set of grammar rules, copying where necessary? I say that because it's how I had thought that uncompyle6 works. I saw that there was a lot of separation between the Python versions already, and I had assumed that it went all the way down to the grammar rules.
In the list above for branching ops there are:
- IfElse
- all of the Comprehensions.
- And/Or ?
That's a good point. In https://github.com/numba/numba-rvsdg/issues/81#issuecomment-1636549934, these formed a single graph node, but they ought to be control flow (including and/or). Maybe that's still to be addressed.
[^1]: We should also keep in mind that the Numba team has a large project in front of them, and ensuring that Python bytecode → AST is not a goal of that project. I'm hoping to piggy-back this on it, since it seems so promising.
I think I am confused. What would you like to do, and how do you propose doing it? I see code, but perhaps I don't understand to bigger picture.
We should also keep in mind that the Numba team has a large project in front of them, and ensuring that Python bytecode → AST is not a goal of that project. I'm hoping to piggy-back this on it, since it seems so promising.
Python bytecode to Python AST (or any sort of equivalent AST) is about as hard as Python bytecode to source code - no easier and no more difficult.
The reality is that with respect to source code is that I have not been able to keep up. I have been doing this as a hobby is something that would keep a couple of people fully occupied full time as a paid activity.
So I am confused on what you expect to see. A proof of concept? Being satisfied working in some limited context? e.g. for basic blocks of only? (Then why RVSDS?) If a limited context, what limited context?
Maybe the parsing code should be the same and each Python version should have its own set of grammar rules, copying where necessary? I say that because it's how I had thought that uncompyle6 works.
Yes, that is how uncompyle6 works. But not in the newer, not public decompiler that works for 3.8-3.10 though. Here there is no grammar sharing. I mention this though as kind of an aside. decompyle3 is in the middle - I started moving in this direction.
The main thing here on parsing is that is important, is that over time there is too much drift in bytecode to have one piece of code that has tests for all the variations or a grammar that is a superset grammar of all the versions. Instead, it has been conceptually easier to separate these into separate pieces of code for a single bytecode version.
The bigger picture isn't clear yet. As a first step, I want to see if this is going to be possible, so it's a proof of concept. Beyond that, if it does work, it could be
I think it would be better overall if there's only one package that provides Python bytecode → AST, though option (1) would have to be something you're interested in, too, if it's going to go forward.
Still, the first thing (for me) to do is the proof of concept, to know if this is even a good approach. What scope? definitely more than basic blocks only, though I was thinking of starting with that and using uncompyle6 for it, exactly because it's an already-solved problem. Then, when addressing something like an if
statement or a for
loop, I'd have a framework to fit it in.
Python AST (or any sort of equivalent AST) is about as hard as Python bytecode to source code - no easier and no more difficult.
Especially because ast.unparse
is in the Python standard library.
not public decompiler that works for 3.8-3.10
That's interesting, and it's the reason I added option (4) above. I didn't know that you were ready to make this leap.
I have been doing this as a hobby is something that would keep a couple of people fully occupied full time as a paid activity.
That's why I thought that using numba-rvsdg was a good idea: it will be kept up-to-date with Python bytecode changes, and the Numba team is large enough that I know this will happen.
For a prototype, I'd start with https://github.com/rocky/python-decompile3 which handles 3.7 and 3.8 code only. The code style is generally more modern Python and it follows better the ideas in the branch that I believe addresses control flow issues.
I have written a bit about recent thoughts and work as to how to do decompilation in https://github.com/rocky/python-uncompyle6/discussions/412
Any linear sequence of bytecodes without
if-else
orfor
constructs will produce only a singlePythonBytecodeBlock
. As soon as control flow or looping are introduced you get multiple blocks.If you get graph structures for
if
-else
andfor
, I think there would also be graph structures forwhile
.
Oh, so sorry for the misunderstanding, for
means "loop" and that includes while
.
For a prototype, I'd start with https://github.com/rocky/python-decompile3 which handles 3.7 and 3.8 code only. The code style is generally more modern Python and it follows better the ideas in the branch that I believe addresses control flow issues.
numba-rvsdg supports 3.11 only.
I just cloned numba-rvsdg and built it. The project seems to be pretty new - less than a year old.
It seems similar to https://github.com/rocky/python-control-flow , although the end goal is to produce a RSVDG rather than a control-flow graph with dominator and reverse dominator information.
If the future of Python is like its past, then if this project is to be long-lived it will have to come up with a strategy for supporting more Python versions than 3.11. Initially and the path of least resistance, is to assume that the next version bytecode won't be that different from the last.
That generally happens. But periodically it doesn't. Python went from a variable 1 or 3 bytecode format to a fixed 2-byte "wordcode" format, I think between 3.5 and 3.6. Nowadays this kind of change wouldn't be a problem, because I think Python now has abstracted instructions from bytecode. But back in the 3.5 to 3.6 change, I don't believe that was the case. And although I haven't looked at the details all that much, Python 3.11 is a bit different from 3.10.
Assuming 3.12 will be not that different from 3.11 there is the question whether you'll allow handing a different bytecode from the bytecode used by Python interpreter running numba-rvsdg. If so, then I believe you will need to use something other than the Python dis
module. And that's why I wrote and use xdis
instead.
But as I said, this kind of complication and decision on how to handle more than one bytecode is definitely something that can be put off for later.
Now that I understand numba-rvsdg better, let me close with how its overall approach to understanding control flow is different from the decompiling approach in uncompyle6 and its descendants.
uncompyle6 leverages a couple features of the high-level bytecode interpreter approach (also found in Ruby, Smalltalk, Lua, various Lisps like Emacs Lisp). First, for all practical purposes the translation to bytecode is done by one kind of parsing system for any given version. (What happens afterwards can vary with say pyston or pypy). The fact that there is just one source code to bytecode translator reduces variation in how a given construct might be conceivably be represented in bytecode: the interpreter picks largely one way and uses that.
Given this, it becomes easier to use parsing technology that is similar (but not the same) as in conventional compilers. Because the translation is like DNA and has its quirks and leaves fingerprints all over the place, we can use those quirks to retrieve more often the exact Python construct that was used, rather than some other construct or sequence that is semantically equivalent.
The control-flow to higher-level construct used in numba-rvsdg, while it is more conventional and is general purpose, does not make use of these kinds of clues that specific compilers put in. Uncompyle6 can distinguish if a and b:
from if a: if b:
in some bytecode versions when he compiler produces different bytecode for these. (Nowadays this happens less often).
Python, in particular, has this rich and whacky set of control-flow structures, such as the "else" branches on "for", "while" and "try" statements. I doubt numba-rsvg would bother to go through the effort needed to look for these as opposed to trying to write these along simpler and more basic control flow structures involving if's wrapped around something else. Should it try to tackle "try ... else .. finally" I think you'll find the control-flow algorithm getting a bit more complicated.
In sum, right now the two approaches to handling control flow, while each is valid, is a bit different from the other.
Thanks @rocky for the detailed insights.
The control-flow to higher-level construct used in numba-rvsdg, while it is more conventional and is general purpose, does not make use of these kinds of clues that specific compilers put in. Uncompyle6 can distinguish
if a and b:
fromif a: if b:
in some bytecode versions when he compiler produces different bytecode for these. (Nowadays this happens less often).
Yes, unlikely uncompyle6, the goal for numba-rvsdg is not to provide a syntactic equivalent (AST recovery) but instead to provide a semantic equivalent. For python compilers (particularly Numba), syntactic information is not important. The semantic equivalent in rvsdg will provide compilers a nice data-centric view (where control-flow is more or less implied) that makes compiler passes a lot easier to write. Also, we don't want to rely on the bytecode "clues" and "quirks" because they change overtime. The heuristic based approach is a huge burden in the current Numba bytecode frontend and we anticipate more bytecode changes due to the faster-cpython effort, which will be adding more bytecode level optimizations.
Python, in particular, has this rich and whacky set of control-flow structures, such as the "else" branches on "for", "while" and "try" statements. I doubt numba-rsvg would bother to go through the effort needed to look for these as opposed to trying to write these along simpler and more basic control flow structures involving if's wrapped around something else. Should it try to tackle "try ... else .. finally" I think you'll find the control-flow algorithm getting a bit more complicated.
We will need to tackle all whacky python control-flow to reach parity with what the Numba bytecode frontend currently support. The good thing is that for
, while
, and else
are modeled by jumps/branches in the bytecode so no additional work is needed. With Python>=3.11, try-except-finally are encoded in a side table in __code__.co_exceptiontable
. The plan is to represent try/except as specific regions so the exception-flow are again implied.
@gordonwatts
FWIW, the following PR does go from SCFG -> AST -- but the SCFG must be generated from a Python AST not bytecode:
https://github.com/numba/numba-rvsdg/pull/127
I suppose however that the SCFG2ASTTransformer
could be a good place to start going from bytecode
back to an AST. Essentially the control-flow constructs are already generated, but the bytecode that describes assignments and expressions would also need to be translated into corresponding AST constructs. So anyway, my implementation listed above does about 50% of the work. Maybe it could be of benefit to someone else.
@esc I had forgotten about his PR. Since 2023, I have been thinking and working on this, and after a few false starts, I have come to the belief that the way to handle control flow in high-level bytecode languages that are grammar based or LLM based is not by selecting the control flow but instead by marking dominator regions in basic blocks and let the grammar or LLM do the actual language-specific control-flow selection.
Basically, this is what https://github.com/rocky/python-control-flow does. Most of it is generic in terms of computing the basic blocks and dominator regions. There is a step at the end that is very specific to marking up the bytecode so that a grammar based approach can use it.
Looking again at the PR based on my current understanding, I am somewhat negatively disposed to the kind of approach in this PR. From a code size perspective alone, this is too small. It is like trying to fit 100 cubic meters of information in a 10 cubic-meter box.
Also, I forgot to mention that I recently gave a talk at BlackHat Asia 2024 that briefly at the end explained some of the thinking here. A link to the slide fragments and text can be found at https://rocky.github.io/blackhat-asia-2024-additional/all-notes-print.html, specifically https://rocky.github.io/blackhat-asia-2024-additional/all-notes-print.html#cfg . Eventually, the video of this taken should be publically available. (Right now I think you can get it if you were registered for the conference or pay some fee to see videos)
Thanks for pointers to follow-up, @esc and @rocky! Both of these are better ideas than the work I started here, so I'm going to close this issue and see how your work develops. If I find a way to get back into this, I'll see if I can help.
From https://github.com/numba/numba-rvsdg/issues/80#issuecomment-1610043380, I noticed that it's possible to make RVSDG from Python bytecode already. (I hadn't been paying close attention to all the progress that was being made here, but I just noticed it now.)
This is a note to myself (which I will assign to myself) to write a demonstrator of Python bytecode → Python AST, at least for a subset of simple expressions, to get a sense of how it will go. I understand that the RVSDG development is very much in flux and that the demonstrator will get out of date quickly, but it would be good for me to do a first test-run, to get familiar with RVSDG as a concept and discover what the fundamental issues are.