Open bartvm opened 6 years ago
If a free variable is undefined, it must be defined exactly once after its capture. If a variable is defined, it cannot be redefined after its capture.
I would also support the simpler "a variable which is captured can only be defined once."
Regarding closure conversion:
def f(env, g):
return g(env)
def g1(env):
return env.x
def g2(env):
return env.x + env.y
def h(c, x, y):
env = makeenv(x=x, y=y)
return f(env, g1) + f(env, g2)
That's not typically how closure conversion works. You are supposed to pack the environment and the function together:
def f(g):
return g() # runtime interprets this as get_function(g)(get_env(g))
def g1(env):
return env.x
def g2(env):
return env.x + env.y
def h(c, x, y):
env = makeenv(x=x, y=y)
g1c = makeclosure(g1, env)
g2c = makeclosure(g2, env)
return f(g1c) + f(g2c)
It might be a good thing to do closure conversion before doing automatic differentiation: Each function's environment is now made explicit, and
y = f(env, x)
will have a derivativedenv, dx = df(dy)
, so there's a nice symmetry to it.
Well, in general you're not going to call y = f(env, x)
, you're going to call y = get_closure(f)(get_env(f), x)
, so you need to derive through get_closure
and get_env
(and make_closure
etc.) I don't think it's necessarily a nicer symmetry than gf, gx = df(gy)
either.
However, I think it's worthwhile to optimize
f
anddf
jointly before closure converting the generateddf
.
If there are substantial benefits to having direct pointers to free variables as in Thorin (and I've expressed mild skepticism about that), you probably don't want to optimize some half-closure-converted representation.
If there are substantial benefits to having direct pointers to free variables as in Thorin (and I've expressed mild skepticism about that), you probably don't want to optimize some half-closure-converted representation.
My suggestion would be to optimize, then closure convert, then AD, then optimize again, then closure convert again. Presumably the first optimization pass would have exploited any optimization opportunities in closures in the primals.
I do think it might be important for backpropagators to have direct pointers to the intermediate variables in the primals. It allows you to optimize the forward and backward pass together as a single graph.
That's not typically how closure conversion works. You are supposed to pack the environment and the function together.
Sure, that's the definition of a closure. My notation wasn't very clear here--I was trying to make it explicit at which points the environment gets constructed, mutated, and passed around without muddling it with packing and unpacking notation--but I don't believe it changes anything about the implementation details I summarized.
(As a sidenote, you don't need to do the packing and unpacking if you are calling a function in the scope it was defined in.)
I would also support the simpler "a variable which is captured can only be defined once."
I think that would actually complicate the implementation rather than simplify it, because doing name resolution by accessing the latest definition in scope is the default. Backtracking to see whether there was more than one definition at the point a variable gets captured is just extra work.
Well, in general you're not going to call y = f(env, x), you're going to call y = get_closure(f)(get_env(f), x), so you need to derive through get_closure and get_env (and make_closure etc.) I don't think it's necessarily a nicer symmetry than gf, gx = df(gy) either.
The get_closure
and get_env
adjoints seem trivial; it's just packing and unpacking stuff.
Intuitively I like y = f(env, x)
/denv, dx = df(dy)
because there is less magic going on: denv
is simply the derivative of an argument like any other. When you use y = f(x)
and denv, dx = df(dy)
, there is a certain kind of asymmetry that bugs me. Although f(x)
can read from its environment, df(dy)
cannot write to it. So whereas f(x)
accesses things implicitly, we have to make df(dy)
's modification on the environment explicit by constructing denv
. From there on we make use of the fact that a function, as a non-numerical object, doesn't have a derivative, so we can use their partials as a way to smuggle environment sensitivities back to the adjoint of the closure definition.
I don't know how much this matters in practice, possibly little, but I do wonder if AD on closure converted code isn't more natural because there are more invariants i.e. type(env) == type(denv)
whereas type(denv) != type(f)
.
Here's a worked out example.
def f(g):
return g()
def h(x):
def g():
return x
y = f(g)
return y
This will be closure converted to something like this.
def g(env):
y = env.x
return y
def h(x):
env = make_env()
env.x = x
g_ = (g, env) # a closure
y = f(g_)
return y
def f(g):
y = g[0](g[1]) # calling a closure
return y
Now the backpropagator approach seems natural to me, because the environment sensitivities have just become the gradients of the environments, which were passed as explicit arguments. This means you don't really need to treat them in any special way.
def g(env):
y = env.x
def dg(dy):
denv = make_env()
denv.x = dy # adjoint of y = env.x
return denv
return y, dg
def h(x):
env = make_env()
env.x = x
g_ = (g, env)
y, df = f(g_)
def dh(dy):
dg_ = df(dy) # normal adjoint of function call
dg, denv = dg_ # adjoint of make_closure
dx = denv.x # writing to environment has its adjoint
return dx
return y
def f(g):
y, dg = g[0](g[1])
def df(dy):
dg1 = dg(dy) # normal adjoint of environment argument
dg = (0, dg1) # non-numeric objects always have gradient 0
return dg
return y
My suggestion would be to optimize, then closure convert, then AD, then optimize again, then closure convert again. Presumably the first optimization pass would have exploited any optimization opportunities in closures in the primals.
But there may be optimization opportunities between the AD-generated code and free variables, e.g.
def f(x):
a = x * x
def g():
return a * log(x)
return g()
dg/dx
is a/x
. With a direct pointer to a
, we can easily see that this is (x*x)/x = x
, but after closure conversion of g
, a
is opaque -- we don't see what it is. So we would miss that optimization.
I think that would actually complicate the implementation rather than simplify it, because doing name resolution by accessing the latest definition in scope is the default. Backtracking to see whether there was more than one definition at the point a variable gets captured is just extra work.
Ah, okay, that makes sense. If that's simpler, then let's do that.
I don't know how much this matters in practice, possibly little, but I do wonder if AD on closure converted code isn't more natural because there are more invariants i.e.
type(env) == type(denv)
whereastype(denv) != type(f)
.
I think closure conversion is actually mildly problematic from a type system point of view, because the environment's type isn't supposed to be part of the closure's type: it's not supposed to leak. If you have a HOF F(f)
, all that matters are the types of f
's inputs and outputs. The environment of f
could be literally anything. So if you do closure conversion, generally speaking, the type of closure_env(f)
is going to be top/any. closure_env
is not typable in the general case and this could be a problem (it could limit expressiveness).
That's why when you closure-convert in a typed language you'd lift functions and insert make_closure
calls, but you wouldn't change call sites from g(x)
to g[0](g[1], x)
, because the type system will choke on this -- you either leave them as they are (the runtime can deal with closures on its own) or have something like run_closure(g, x)
.
This might be a reason why the grad transform is difficult to integrate in a type system, since it is potentially equivalent to having a closure_env
primitive. So we may very well have that problem anyway.
(Also, nothing actually stops type(denv) == type(f)
from being true. At some point in master
I did exactly that. The semantics are a bit dubious, so I wouldn't necessarily recommend it, but it works just fine.)
Actually, I suppose an explicit closure could be typed as (e, e -> a -> b)
, but it does require e
to be a type variable in several situations.
But there may be optimization opportunities between the AD-generated code and free variables [...]
That's a good example.
Actually, I suppose an explicit closure could be typed as (e, e -> a -> b), but it does require e to be a type variable in several situations.
The environment sensitivities returned by a backpropagator must have a type, no? The symmetry I am thinking about is that the type of f
would be env -> a -> b
which makes the backpropagator have type b -> (a, env)
. If you make the closure transparent then you have type a -> b
for the primal and b -> (a, ?)
for the backpropagator, so we need to have an explicit type for environments anyway.
(Also, nothing actually stops type(denv) == type(f) from being true. At some point in master I did exactly that. The semantics are a bit dubious, so I wouldn't necessarily recommend it, but it works just fine.)
Letting type(denv) == type(f)
seems dubious indeed, because now you have a -> b
for the primal, and b -> (a, (a -> b))
for the backpropagator...
So here's another way of thinking about it, which maybe clarifies my thinking a bit: If primals read from their environment, adjoints should really be writing to an adjoint environment. For example, take this function
def f():
return x ** 2 # x is a free variable read from the environment
then its transformation looks like
def f():
def df(dy):
dx = x * 2 * dy # dx is an environment sensitivity
return x ** 2, df
and if you write this out more explicitly, it looks something like this, with line by line adjoints whose types work out
env: Env = make_env()
env.x: float = x
f_: Callable[[], float] = make_closure(f, env)
y, df_: Tuple[float, Callable[[float], None]] = f_()
...
df_(dy) # No return value
df, denv: Tuple[Callable[[float], None], Env] = unmake_closure(df_)
dx: float = denv.dx
denv = make_env() # type-specific zero gradient
Although this is nicely symmetric, it shows why this transformation is awkward: The adjoint of reading from an environment is writing to one, but that requires an impure function.
The backpropagator approach in the paper sidesteps this problem by making the environment sensitivities explicit return values. However, this means exposing them to the type system and higher-order gradients, which prompted my question: Why not just make environments of the primals explicit as well i.e. do closure conversion? But I agree that this is not ideal e.g. for optimization.
A third option would be a representation in which the backpropgator can explicitly write to the scope of other functions, like in Python with the nonlocal
keyword. I don't think this is really possible though, since you would have to write into the scope of the backpropagator of the function that your primal's free variables were bound in, which can only be determined at runtime.
def g(x):
def f():
def df(dy):
nonlocal dx # should write into scope of dg somehow
dx += x * 2 * dy
return x ** 2, df
y, df = f()
def dg(dy):
dx = 0
df(dy)
return dx
return y, dg
So I guess that if we don't want to do closure conversion before AD, the only option is to accept the slightly awkward asymmetry and just return environments explicitly (and just explicitly type them as an environment with accompanying type map that we can use for type inference).
Yeah, I see what you mean, and the gradient transform's certainly going to be simpler if it doesn't need all the special machinery to deal with free variables. This being said, I think I may already have a working implementation of Grad
in #22 that uses our representation directly, so unless we determine the advantages of pointing to free variables directly are not worth it, I say we just proceed as planned, but keep this option in mind if we need to reevaluate later.
I did a bit of research on environments, closures, lambda lifting, etc. etc. for Myia and thought I would write a quick summary. I'll go over some basic definitions to ground the discussion, even though people might be familiar with them already:
Static vs runtime name resolution
Python's name resolution is done at runtime
But not for the local scope
It's worth noting that the Python documentation says:
Note that closures are always runtime data structures. The runtime aspect of dynamic name resolution refers specifically to the fact that the namespace can be manipulated at runtime, and that no compile-time errors will be thrown for undefined variables. The actual name resolution rules can be the same for dynamic and static resolution.
Forward references
Python's support for forward references comes from the fact that it does runtime name resolution, but the two are separate i.e. forward references can be supported in static name resolution (although it generally requires backtracking/multiple passes).
Referencing names or values
Forward references makes some situations ambiguous, because we need to decide whether to capture values at the point of definition or at the call site.
If your intermediate representation references variables by name, it is possible to deal with this ambiguity i.e. one can say "return the variable named x", and the variable
x
can change between calls tog
.However, Myia references variables by value. That is,
return x
is parsed to a graph with a direct edge to the variablex
in its enclosing scope; we have to draw an edge to eitherx = 1
orx = 2
.The advantage of having these direct edges to the outer graph is that it simplifies optimization, because the outer and inner scope can be optimized jointly (which is particularly important in the case of backpropagators). The downside is that we cannot apply the same name resolution rules as Python. To avoid discrepancies between the two approaches, the following rule can be enforced:
Lambda lifting vs. closure conversion
Our intermediate representation can probably be executed directly using something like a spaghetti stack (as the current VM does). However, for code generation we will probably want to do lambda lifting and/or closure conversion at one point during the lowering process.
Lambda lifting involves the removal of free variables by turning them into parameters. This requires changing the arguments passed at the call site. The problem with lambda lifting is that it cannot generally be done when you have higher-order functions.
Closure conversion removes free variables by packing them into environments that the function is called with. These are runtime data structures.
Implementing closures
If the nested code is executed directly, closures only need to be created when a functions is returned. However, if the goal is to make every function top-level, all nested functions must have their free variables removed and turned into environments.
There are roughly three methods to creating environments for closures. With each of these approaches you can make the environments mutable or immutable. A mutable environment means that a closure will be called with the environment at the time of its call (like Python), whereas an immutable environment means that a closure will be called with the environment at the time of its definition.
1. Copy
Each time a closure is created we create a new environment (heap allocated) using the current locals (stack allocated).
The downside of this approach is that local variables need to be copied into the environment. The upside is that if the
makeenv
call is not reached, no environment will be constructed.2.a. Rewrite with shared environments
The rewriting approach rewrites functions that might create closures to store and load local variables in an environment which can then be used directly to create a closure.
The shared environment approach means that each environment only contains the variables created in that function's scope, but also contains a pointer to its parent environment. This saves memory, but might require a bit of pointer hoping to resolve environment variables from outer scopes at runtime.
2.b. Rewrite with flat environments
Flat environments optimize lookup performance by making sure that all free variables are stored in the environment. The downside is that it costs more memory because for each nested function the environment needs to be copied again.
Corner cases
A few noteworthy corner cases are recursion and mutual recursion. If we use flat environments, we have to make sure that we don't create new ones each time, but instead re-use the original ones the function was called with. For mutual recursion this gets trickier: All the functions involved in the mutual recursion must share a single flat environment.
When using shared environments, the closure environments simply become cyclical for any form of recursion (i.e.
f
inside off
is a closure with the environment thatf
was first called with).Random thoughts
y = f(env, x)
will have a derivativedenv, dx = df(dy)
, so there's a nice symmetry to it. However, I think it's worthwhile to optimizef
anddf
jointly before closure converting the generateddf
.global_env = env.globals; sum = global_env.sum
i.e. one variable in the environment will be another environment containing the globals (so accessing Python globals will require two lookups).