Open adarshp opened 5 years ago
@adarshp @pauldhein That's an interesting idea and we can certainly use that approach. I don't think there are any technical hurdles to doing this. (The type information for arguments and return values should be available from declarations in the original source code.)
There was one thing that I found really interesting thing about this. Based on my understanding of how eval
is implemented, as well as various online posts (e.g., "Why eval/exec is bad", "Be careful with exec and eval in Python"), I had fully expected the code using eval
and exec
to be significantly slower than the straightforward function definition. When I did an experiment, though, the three versions turned out to have essentially identical performance. I was not expecting this. :) The code I used for my experiment is given below. The run times were:
T0 = 19.066645 [straightforward function] T1 = 20.407631 [eval] T2 = 20.169486 [exec]
Here is the timing experiment:
#!/usr/bin/env python3
import time
TMAX = 100.00
TMIN = 0.01
NITERS = 100000000
def get_time0(niters):
n = 0
time0 = time.time()
for i in range(niters):
n += PETPT__lambda__TD_0_0(TMAX, TMIN)
time1 = time.time()
return time1 - time0
def get_time1(niters):
n = 0
time0 = time.time()
for i in range(niters):
n += PETPT__lambda__TD_0_1(TMAX, TMIN)
time1 = time.time()
return time1 - time0
def get_time2(niters):
n = 0
time0 = time.time()
for i in range(niters):
n += PETPT__lambda__TD_0_2(TMAX, TMIN)
time1 = time.time()
return time1 - time0
def PETPT__lambda__TD_0_0(TMAX, TMIN):
TD = ((0.6*TMAX)+(0.4*TMIN))
return TD
PETPT__lambda__TD_0_1 = eval("lambda TMAX, TMIN: ((0.6*TMAX)+(0.4*TMIN))")
exec("def PETPT__lambda__TD_0_2(TMAX: float, TMIN: float) -> float: return ((0.6*TMAX)+(0.4*TMIN))")
def main():
t0 = get_time0(NITERS)
print('T0 = {:f}'.format(t0))
t1 = get_time1(NITERS)
print('T1 = {:f}'.format(t1))
t2 = get_time2(NITERS)
print('T2 = {:f}'.format(t2))
main()
That's good news, thanks for doing that benchmarking! I would really like to go into the guts of the program analysis code sometime in the near future and understand it better as well...
@adarshp @cl4yton I think it would be useful for us to sit down and nail down our implementation decisions along with a priority order on them so that we're all in agreement.
@skdebray, from my understanding of why eval could be slow, the prohibition applies not to defining functions in eval, but to doing calculations inside eval. If, like in your benchmark, you eval a function definition, then that function is equivalent to the function defined in literal code. I think you will see a performance difference with
# parsing and interpreting is slooooow.
x = 0
for i in range(1000):
eval("x += 1")
print(x)
# faster because we don't have to parse and interpret the "x+=1" every time.
x = 0
for i in range(1000):
x += 1
print(x)
I think that using a pickle specific format to represent the code really commits you to being python specific. For example, how would you write an interpreter for GrFNs in another language if the functions are stored as pickles of python functions?
Having a language neutral representation of the functions would be helpful. For example
def PETPT__lambda__TD_0(TMAX, TMIN):
TD = ((0.6*TMAX)+(0.4*TMIN))
return TD
could be represented as
(def PETP_lambda_TD_0
(TMAX, TMIN)
(body (assign TD ((0.6*TMAX)+(0.4*TMIN)))
(return TD))
)
which is a lispy way of writing this JSON value:
{ "funcdef": {"name": "PETPT__lambda__TD_0",
"args": ["TMAX", "TMIN"],
"body": [ "assign TD ((0.6*TMAX)+(0.4*TMIN))",
"return TD"]
}}
The particular syntax for expressing the functions doesn't matter but I think that whatever it is, it should be something that can be interpreted by multiple languages. That way someone can come along and write a GrFN to L compiler where L is some awesome programming language.
@jpfairbanks You're absolutely correct about the eval'd code's performance in this situation -- my initial misconception was just one of my un-thought-through prejudices ("All Xs are Y! Grrr!"). Once I saw the numbers and looked at the code, the reason immediately became clear.
Re: "using a pickle specific format to represent the code really commits you to being python specific" -- your point is well taken, and was raised also by @pratikbhd (also working on this project). We front-end folks are happy to provide @adarshp whatever representation allows his team to move forward quickly. But I'd be happy to discuss this further.
Yeah I agree that "just pickle it" is really easy to get going early, but I think in the long run limits generalizability. I think there are functions that cannot be pickled. This post https://medium.com/@emlynoregan/serialising-all-the-functions-in-python-cd880a63b591 shows that dill
cannot handle mutually recursive functions. I think this restriction also applies to libraries that use numpy/scipy because they link against particular C/fortran codes and are generally pretty complicated.
Is there some document/code that describes how the fortran to python translation is done? I would like to look there to see what the intermediate representation looks like. Without reading the code I assume it looks like
Hi @jpfairbanks, thanks for pointing that out, I was unaware of those limitations of pickling. Looks like we have some thinking to do on our end about the IR. I believe the IR (as opposed to the IR') that you referred to in #132 would be what the variable outputDict
points to here: https://github.com/ml4ai/delphi/blob/f9c498d03870f7a537c0f6f32bfe85bc51f74b59/delphi/program_analysis/autoTranslate/scripts/translate.py#L440
The script translate.py
contains the code that transforms the XML output of the OpenFortranParser into a dict that then gets transformed to Python code by pyTranslate
, which in turn gets parsed by the Python ast module to create an AST, which then gets parsed by genPGM.py
to produce the GrFN object.
An example of the pipeline can be found in the test module, test_program_analysis.py
in these lines: https://github.com/ml4ai/delphi/blob/f9c498d03870f7a537c0f6f32bfe85bc51f74b59/tests/test_program_analysis.py#L22-L53
Also, while the idea of moving to a less language-specific representation does sound attractive to me, I'm hesitant to commit to it with a concrete timeline because it would involve a non-trivial refactoring of our codebase, and I am not sure we (or more specifically, @skdebray and @pratikbhd , who will be doing the actual heavy lifting here, so I don't want to be cavalier about committing their time) have the requisite person-hours to implement this change immediately, especially given the short 18-month timeline and tight schedule of deliverables negotiated in our OT contract with DARPA. Besides, for the ASKE program, we are restricting ourselves to analyzing Fortran programs, so the numpy/scipy issue is, I think a bit beyond the scope of this iteration of the AutoMATES project.
I say 'this iteration', because I am optimistic that if all the ASKE performer groups do well on this program, it will help make the case for a four or five-year full DARPA program with the same themes as ASKE, which would really be required to take the tools that we are making from prototypes to more full-fledged products, and will give us the breathing room to prioritize generalization rather than quick prototyping.
That being said, I do think it is important to generalize the representation, and I would like to revisit this issue down the road, so I'm going to leave this issue open.
Also, we welcome pull requests, so if this language-specific IR would really help boost the work you are doing in the ASKE program by giving you access to semantically enriched computation graphs from Fortran programs (which would make sense, since they comprise a not-insignificant amount of existing scientific code), and you believe that you can implement this change in a timeframe consistent with meeting your schedule of deliverables, you are welcome to try to do so, and I would be happy to help as much as possible.
Yeah I totally understand about the time. The schedule is crazy compressed.
It looks like swapping out implementations of the following functions could yield an implementation that generates any kind of code (this is a note to future implementors).
self.printFn = {
"subroutine": self.printSubroutine,
"program": self.printProgram,
"call": self.printCall,
"arg": self.printArg,
"variable": self.printVariable,
"do": self.printDo,
"index": self.printIndex,
"if": self.printIf,
"op": self.printOp,
"literal": self.printLiteral,
"ref": self.printRef,
"assignment": self.printAssignment,
"exit": self.printExit,
"return": self.printReturn,
"function": self.printFunction,
"ret": self.printFuncReturn,
}
I don't expect working with fortran programs yet. We are starting with programs that are written in Julia as a source language, because that will let us focus on the metamodeling, reasoning, and transformation tasks without having to solve the very difficult problem of information extraction from lower level code that you guys are tackling. We also found that epirecipes cookbook, which gives us some "real world" models written by scientists with explanatory text and references to the original papers.
Of course we will be restricted with regard to established applications which are often written in Fortran. But when we get to that bridge (in year 3 of our 5 year program 😉 ) we will cross it by writing a Julia backend for AutoMATES and pressing "the easy button".
@pauldhein and I were discussing this a while ago - it occurred to us that since the output of the program analysis pipeline is a pickled Python object, there is no reason, in principle, why the lambda functions couldn't be pickled alongside the rest of the output. For example, the following function in
PETPT_lambdas.py
:Could be constructed as follows:
Here, the string argument to
eval
could be constructed in the same way the second line of the existing lambda functions in thelambdas.py
files are built up from parsing the XML AST output of OFP.Alternatively (and this seems to me to be the right way), one could take advantage of type annotations and use the more powerful
def
syntax for declaring functions -(assuming we can get these types - can we?)
and later the
PETPT__lambda__TD_0
object can be used as a value in thedict
object produced bygenPGM.py
.Since functions are first class objects in Python, you can actually set attributes for functions as well - perhaps this might make it easier to keep track of things like the function type (
assign
/lambda
/condition
, etc.), the reference, and so on:And then if someone wants to serialize the GrFN object to a JSON file, we could define the following function:
Not super urgent but I do think that it might be a investment worth making to simplify things in the long run...