sradc / SmallPebble

Minimal deep learning library written from scratch in Python, using NumPy/CuPy.
Apache License 2.0
117 stars 15 forks source link

Why enclosing Variable inside lambda instead of the variable only? #3

Closed Islam0mar closed 3 years ago

Islam0mar commented 3 years ago

First of all thanks a lot for your amazing autodiff post.

For Nth order derivatives you wrote 2.To change local_gradients to contain functions, instead of local gradient values, I think that if you only store a Variable with the weight/local_gradient, it would be sufficient as the lambda function only multiplies the path value by a constant factormultiply_by_locgrad, so get_gradients would use path_value * local_gradient where both are of Variable type.

side-note: I have combined that with a tape/array for storing variables to save memory, but the combinatorial explosion in the number of possible paths in the graph does too much allocation especially for nth-derivative.

sradc commented 3 years ago

Hey, nice to hear from you; glad the post was helpful.

The lambda is used because local derivatives are being computed using Variable operations. It adds laziness, which prevents infinite recursion. For example:

def mul(a, b):
    value = a.value * b.value
    local_gradients = (
        (a, lambda path_value: path_value * b),
        # ^  the * operator results in this function, `mul` being called
        # the lambda prevents infinite recursion -> stackoverflow
        (b, lambda path_value : path_value * a)
    )
    return Variable(value, local_gradients)

mul uses mul to compute its local gradient, which in turn uses mul to compute its local gradient, and so on. This would recurse until stack overflow.

You can always give it a try without the lambda, to see what happens.


Nice that you added a gradient tape! That is what the more low level libraries do. As for saving memory, variables in the article aren't duplicated, but referenced, so memory usage isn't too bad.

The combinatorial explosion is a shame :D. Originally I had Nth order derivatives in this repo (SmallPebble), but it made things too slow, so I removed it.

Islam0mar commented 3 years ago

I mean using the same old definition with only changing the variable type:

def mul(a, b):
    value = a.value * b.value
    local_gradients = (
        (a, b),
        # Now (path_val * b).value is reserved
        (b, a)
    )
    return Variable(value, local_gradients)
def compute_gradients(variable, path_value):
        for child_variable, local_gradient in variable.local_gradients:
            # "Multiply the edges of a path":
            value_of_path_to_child = path_value * local_gradient
            # "Add together the different paths":
            gradients[child_variable] += value_of_path_to_child
            # recurse through graph:
            compute_gradients(child_variable, value_of_path_to_child)

I am searching for ways to factor paths that Factoring Paths mentions. I can think of optimizing x*x to x.val*x.val with one gradient x.grads = ((x,var(2)*x)) and the same for +, but I don't know any more general rule without symbolic manipulation.

Thanks for your helping.

sradc commented 3 years ago

Ah, I could have picked a better example than mul, you are correct that we don't need lambdas for mul to work.

Here's div:

def div(a, b):
    value = a.value / b.value
    local_gradients = (
        (a, lambda path_value : path_value * ONE/b),
        # ^ ONE/b calls div
        (b, lambda path_value : path_value * NEG_ONE * a/(b*b))
    )
    return Variable(value, local_gradients)

Here the weights are computed using div. So if we compute them immediately, we get infinite recursion.


Thanks for sharing the Factoring Paths link, that's a great post.

Note that we do bypass the combinatorial explosion of paths (discussed here in the post), simply by using reverse-mode autodiff.

{“forward-mode differentiation” and “reverse-mode differentiation” are} algorithms for efficiently computing the sum by factoring the paths. Instead of summing over all of the paths explicitly, they compute the same sum more efficiently by merging paths back together at every node.

What you are attempting is a step even further than this. On first glance, I'm not sure that you will see enormous benefits from doing it. You could for example, define a pow operator, to enable backprop of operations like a*a*a*a*a to be more efficient.

Islam0mar commented 3 years ago

Alright I see that, but you could make new variable for any recursive case like this: (a, Variable(1/b.value , ()) and (b, Variable(-a.value / b.value / b.value, ())--I don't know if that would miss some partial derivatives or grads shouldn't be empty!--. I have ported that to common lisp and lambda(capturing a variable) was more expensive than one new variable.


Aha, I got it. I was thinking of only one node. Thanks

sradc commented 3 years ago

I think that creating new Variables will lead to errors in the higher order gradients, because these are computed using the local gradients of the weights.

If you are planning to do gradient based optimisation, one way you could keep things fast is to only compute first derivatives, and estimate higher order derivatives numerically (this is the intuition behind optimisation methods such as Adam).

Cool that you are implementing this in Lisp!

Islam0mar commented 3 years ago

Thanks for your time. I would look further into these.