ml4ai / delphi

Framework for assembling causal probabilistic models from text and software.
http://ml4ai.github.io/delphi
Apache License 2.0
24 stars 20 forks source link

Pickling lambda functions #130

Open adarshp opened 5 years ago

adarshp commented 5 years ago

@pauldhein and I were discussing this a while ago - it occurred to us that since the output of the program analysis pipeline is a pickled Python object, there is no reason, in principle, why the lambda functions couldn't be pickled alongside the rest of the output. For example, the following function in PETPT_lambdas.py:

def PETPT__lambda__TD_0(TMAX, TMIN):
    TD = ((0.6*TMAX)+(0.4*TMIN))
    return TD

Could be constructed as follows:

PETPT__lambda__TD_0 = eval("lambda TMAX, TMIN: ((0.6*TMAX)+(0.4*TMIN))")

Here, the string argument to eval could be constructed in the same way the second line of the existing lambda functions in the lambdas.py files are built up from parsing the XML AST output of OFP.

Alternatively (and this seems to me to be the right way), one could take advantage of type annotations and use the more powerful def syntax for declaring functions -

exec("def PETPT__lambda__TD_0(TMAX: float, TMIN: float) -> float: return ((0.6*TMAX)+(0.4*TMIN))")

(assuming we can get these types - can we?)

and later the PETPT__lambda__TD_0 object can be used as a value in the dict object produced by genPGM.py.

Since functions are first class objects in Python, you can actually set attributes for functions as well - perhaps this might make it easier to keep track of things like the function type (assign/lambda/condition, etc.), the reference, and so on:

PETPT__lambda__TD_0.fn_type = "lambda"
PETPT__lambda__TD_0.reference = 9
PETPT__lambda__TD_0.target = "TD"

And then if someone wants to serialize the GrFN object to a JSON file, we could define the following function:

import inspect

def to_json_serialized_dict(function):
    return {
        "name": function.__name__,
        "type": function.fn_type,
        "target": function.target,
        "sources": inspect.signature(function) 
        # Plus some processing to massage the above into a JSON-serializable dict
        ...
    }

Not super urgent but I do think that it might be a investment worth making to simplify things in the long run...

skdebray commented 5 years ago

@adarshp @pauldhein That's an interesting idea and we can certainly use that approach. I don't think there are any technical hurdles to doing this. (The type information for arguments and return values should be available from declarations in the original source code.)

There was one thing that I found really interesting thing about this. Based on my understanding of how eval is implemented, as well as various online posts (e.g., "Why eval/exec is bad", "Be careful with exec and eval in Python"), I had fully expected the code using eval and exec to be significantly slower than the straightforward function definition. When I did an experiment, though, the three versions turned out to have essentially identical performance. I was not expecting this. :) The code I used for my experiment is given below. The run times were:

T0 = 19.066645 [straightforward function] T1 = 20.407631 [eval] T2 = 20.169486 [exec]

Here is the timing experiment:

#!/usr/bin/env python3

import time

TMAX = 100.00
TMIN = 0.01

NITERS = 100000000

def get_time0(niters):
    n = 0
    time0 = time.time()
    for i in range(niters):
        n += PETPT__lambda__TD_0_0(TMAX, TMIN)
    time1 = time.time()
    return time1 - time0

def get_time1(niters):
    n = 0
    time0 = time.time()
    for i in range(niters):
        n += PETPT__lambda__TD_0_1(TMAX, TMIN)
    time1 = time.time()
    return time1 - time0

def get_time2(niters):
    n = 0
    time0 = time.time()
    for i in range(niters):
        n += PETPT__lambda__TD_0_2(TMAX, TMIN)
    time1 = time.time()
    return time1 - time0

def PETPT__lambda__TD_0_0(TMAX, TMIN):
    TD = ((0.6*TMAX)+(0.4*TMIN))
    return TD

PETPT__lambda__TD_0_1 = eval("lambda TMAX, TMIN: ((0.6*TMAX)+(0.4*TMIN))")

exec("def PETPT__lambda__TD_0_2(TMAX: float, TMIN: float) -> float: return ((0.6*TMAX)+(0.4*TMIN))")

def main():
    t0 = get_time0(NITERS)
    print('T0 = {:f}'.format(t0))

    t1 = get_time1(NITERS)
    print('T1 = {:f}'.format(t1))

    t2 = get_time2(NITERS)
    print('T2 = {:f}'.format(t2))

main()
adarshp commented 5 years ago

That's good news, thanks for doing that benchmarking! I would really like to go into the guts of the program analysis code sometime in the near future and understand it better as well...

skdebray commented 5 years ago

@adarshp @cl4yton I think it would be useful for us to sit down and nail down our implementation decisions along with a priority order on them so that we're all in agreement.

jpfairbanks commented 5 years ago

@skdebray, from my understanding of why eval could be slow, the prohibition applies not to defining functions in eval, but to doing calculations inside eval. If, like in your benchmark, you eval a function definition, then that function is equivalent to the function defined in literal code. I think you will see a performance difference with

# parsing and interpreting is slooooow.
x = 0
for i in range(1000):
    eval("x += 1")
print(x)
# faster because we don't have to parse and interpret the "x+=1" every time.
x = 0
for i in range(1000):
    x += 1
print(x)

I think that using a pickle specific format to represent the code really commits you to being python specific. For example, how would you write an interpreter for GrFNs in another language if the functions are stored as pickles of python functions?

Having a language neutral representation of the functions would be helpful. For example

def PETPT__lambda__TD_0(TMAX, TMIN):
    TD = ((0.6*TMAX)+(0.4*TMIN))
    return TD

could be represented as

(def PETP_lambda_TD_0
        (TMAX, TMIN)
        (body (assign TD ((0.6*TMAX)+(0.4*TMIN)))
              (return TD))
)

which is a lispy way of writing this JSON value:

{ "funcdef": {"name": "PETPT__lambda__TD_0",
  "args": ["TMAX", "TMIN"],
  "body": [ "assign TD ((0.6*TMAX)+(0.4*TMIN))",
              "return TD"]
}}

The particular syntax for expressing the functions doesn't matter but I think that whatever it is, it should be something that can be interpreted by multiple languages. That way someone can come along and write a GrFN to L compiler where L is some awesome programming language.

skdebray commented 5 years ago

@jpfairbanks You're absolutely correct about the eval'd code's performance in this situation -- my initial misconception was just one of my un-thought-through prejudices ("All Xs are Y! Grrr!"). Once I saw the numbers and looked at the code, the reason immediately became clear.

Re: "using a pickle specific format to represent the code really commits you to being python specific" -- your point is well taken, and was raised also by @pratikbhd (also working on this project). We front-end folks are happy to provide @adarshp whatever representation allows his team to move forward quickly. But I'd be happy to discuss this further.

jpfairbanks commented 5 years ago

Yeah I agree that "just pickle it" is really easy to get going early, but I think in the long run limits generalizability. I think there are functions that cannot be pickled. This post https://medium.com/@emlynoregan/serialising-all-the-functions-in-python-cd880a63b591 shows that dill cannot handle mutually recursive functions. I think this restriction also applies to libraries that use numpy/scipy because they link against particular C/fortran codes and are generally pretty complicated.

Is there some document/code that describes how the fortran to python translation is done? I would like to look there to see what the intermediate representation looks like. Without reading the code I assume it looks like

  1. read fortran source code into a tree (xml, json, in memory objects)
  2. transform that tree according to some logic
  3. render a set of python code snippets by walking the tree
adarshp commented 5 years ago

Hi @jpfairbanks, thanks for pointing that out, I was unaware of those limitations of pickling. Looks like we have some thinking to do on our end about the IR. I believe the IR (as opposed to the IR') that you referred to in #132 would be what the variable outputDict points to here: https://github.com/ml4ai/delphi/blob/f9c498d03870f7a537c0f6f32bfe85bc51f74b59/delphi/program_analysis/autoTranslate/scripts/translate.py#L440

The script translate.py contains the code that transforms the XML output of the OpenFortranParser into a dict that then gets transformed to Python code by pyTranslate, which in turn gets parsed by the Python ast module to create an AST, which then gets parsed by genPGM.py to produce the GrFN object.

An example of the pipeline can be found in the test module, test_program_analysis.py in these lines: https://github.com/ml4ai/delphi/blob/f9c498d03870f7a537c0f6f32bfe85bc51f74b59/tests/test_program_analysis.py#L22-L53

Also, while the idea of moving to a less language-specific representation does sound attractive to me, I'm hesitant to commit to it with a concrete timeline because it would involve a non-trivial refactoring of our codebase, and I am not sure we (or more specifically, @skdebray and @pratikbhd , who will be doing the actual heavy lifting here, so I don't want to be cavalier about committing their time) have the requisite person-hours to implement this change immediately, especially given the short 18-month timeline and tight schedule of deliverables negotiated in our OT contract with DARPA. Besides, for the ASKE program, we are restricting ourselves to analyzing Fortran programs, so the numpy/scipy issue is, I think a bit beyond the scope of this iteration of the AutoMATES project.

I say 'this iteration', because I am optimistic that if all the ASKE performer groups do well on this program, it will help make the case for a four or five-year full DARPA program with the same themes as ASKE, which would really be required to take the tools that we are making from prototypes to more full-fledged products, and will give us the breathing room to prioritize generalization rather than quick prototyping.

That being said, I do think it is important to generalize the representation, and I would like to revisit this issue down the road, so I'm going to leave this issue open.

Also, we welcome pull requests, so if this language-specific IR would really help boost the work you are doing in the ASKE program by giving you access to semantically enriched computation graphs from Fortran programs (which would make sense, since they comprise a not-insignificant amount of existing scientific code), and you believe that you can implement this change in a timeframe consistent with meeting your schedule of deliverables, you are welcome to try to do so, and I would be happy to help as much as possible.

jpfairbanks commented 5 years ago

Yeah I totally understand about the time. The schedule is crazy compressed.

It looks like swapping out implementations of the following functions could yield an implementation that generates any kind of code (this is a note to future implementors).

        self.printFn = {
            "subroutine": self.printSubroutine,
            "program": self.printProgram,
            "call": self.printCall,
            "arg": self.printArg,
            "variable": self.printVariable,
            "do": self.printDo,
            "index": self.printIndex,
            "if": self.printIf,
            "op": self.printOp,
            "literal": self.printLiteral,
            "ref": self.printRef,
            "assignment": self.printAssignment,
            "exit": self.printExit,
            "return": self.printReturn,
            "function": self.printFunction,
            "ret": self.printFuncReturn,
        }

I don't expect working with fortran programs yet. We are starting with programs that are written in Julia as a source language, because that will let us focus on the metamodeling, reasoning, and transformation tasks without having to solve the very difficult problem of information extraction from lower level code that you guys are tackling. We also found that epirecipes cookbook, which gives us some "real world" models written by scientists with explanatory text and references to the original papers.

Of course we will be restricted with regard to established applications which are often written in Fortran. But when we get to that bridge (in year 3 of our 5 year program 😉 ) we will cross it by writing a Julia backend for AutoMATES and pressing "the easy button".