jython / jython

Python for the Java Platform
https://www.jython.org
Other
1.14k stars 185 forks source link

Is there any challenges when translating Python into JAVA IR? #311

Closed AGFACBNNR closed 4 months ago

AGFACBNNR commented 4 months ago

Recently I'm trying to translate Python3 code into JAVA IR to help statically analyze Python code.

But I'm not sure whether this idea is feasible.

Therefore, I'm wondering what's the main challenges during the translating process and whether these challenges can be overcome.

Thanks a lot for your help.

jeff5 commented 4 months ago

IR meaning ... JVM byte code?

That's quite a big undertaking. Python is pretty good at generating an AST (just in pure Python).

Parsing is performed by the PEG parser, which is built from actions in the grammar (fragments of C for CPython). It generates an AST that consists of Python objects, implemented in C (of course).

The CPython compiler generates Python byte code from that AST. Actually, it generates an intermediate representation (IR) that is then turned into an optimised Python byte code when it runs.

Surely if your aim is static analysis, the Python IR is a better target? The problem there is that it evolves quite quickly from one version to the next. Maybe that's the attraction of JVM byte code, which has been stable for decades?

The approach we are taking (experimentally so far) is to adopt the Python PEG parser, define AST classes in Java, and define actions in Java so as to obtain a parser in Java. We would then generate JVM bytecode from the AST, but haven't worked on that bit at all yet.

However, I have incomplete ideas about the JVM byte code equivalent to some basic Python IR. It would make heavy use of the invokedynamic opcode. That, I think, would be your problem: invokedynamic is effectively a hook for user-defined operations that extend the JVM language. Thus, I think, you have a similar problem to the shifting opcodes of CPython. But worse, because we can't generate even one example yet.

AGFACBNNR commented 4 months ago

Thank you very much for your reply.

Actually, due to its dynamic feature, there's no effective tools for Python static analysis. Therefore, I'm considering about translating Python into JVM bytecode or Soot IR. However, I am uncertain whether JVM bytecode can accurately represent Python code without any loss of information. My interest in this approach was piqued by the impressive work done with Jython2, which has led me to believe that such a translation may indeed be feasible.

I would greatly appreciate it if you could provide me with more details regarding the invokedynamic and shifting opcodes of CPython that you mentioned in your previous communication. Any additional information you could share on this topic would be immensely helpful in my research.

Thank you once again for your time and assistance.

jeff5 commented 4 months ago

The classic paper introducing invokedynamic is https://www.semanticscholar.org/paper/Bytecodes-meet-combinators%3A-invokedynamic-on-the-Rose/b4da83cebdbc0b1e5dea11cb151a61e9d8bf7112 . There was a rash of articles and presentations around 2011-2015 on the subject. See also.

A good way to explore Python byte code is with the dis module. The description begins with a warning about the way byte code changes from one minor version to the next and gives a few recent versions at which the change was radical.

Jython 2 was designed before invokedynamic so it suffers from the problems Rose's paper discusses and Jython 3 is an opportunity to benefit from the feature. My experiments with simple expression evaluation indicate there is a marginal performance gain, but actually Jython 2 is pretty good.

I am uncertain whether JVM bytecode can accurately represent Python code without any loss of information.

Information is always discarded in compilation. (For example, compilers for many languages are able to discard symbol information available from the text once they have assigned addresses and offsets to variables.) The aim is to create something with the run-time behaviour intended by the original program text.

due to its dynamic feature, there's no effective tools for Python static analysis

It seems to me this is in the nature of the language, so must be true of any implementation. Given the text:

def f(a, b):
    return a+b

you have no idea what will happen on a call and what side effects it may have, since the meaning of addition is entirely up to a and b to decide. This has to be the case because it is Python, whether the compiled form is a CPython BINARY_OP(ADD_OP) or a JVM dynamic call site with the binary operations bootstrap method.

AGFACBNNR commented 4 months ago

It seems to me this is in the nature of the language, so must be true of any implementation.

I cannot agree more. The addition example is just the one I used when I explained to my monitor the challenges in Python static analysis.

You are right. I’ve got the answer I need. Thanks a lot for your kind help.