get_updates in a custom optimizer class

keras-team / keras

Deep Learning for humans

http://keras.io/

Apache License 2.0

62.14k stars 19.49k forks source link

get_updates in a custom optimizer class #6175

Closed slavakung closed 7 years ago

slavakung commented 7 years ago

Hello,

I am a researcher in optimization and I trying to write a custom optimizer. I have come across a problem.

Take any optimizer code, say just copy SGD. In the beginning of get_updates, you see

    grads = self.get_gradients(loss, params)

now add the following line right after this one:

    gradsb = self.get_gradients(loss, [tf.Variable(a) for a in params])

this should compute the gradients at a new tensor, with all the values the same as before

now try to see what you get:

    for a in gradsb:
       print(a)

you get a list of Nones (but if you print the list grads you see that they are still Tensors)

Why? And how to circumvent this problem? This is important as I'd like to compute the gradients at another point for my algorithm.

On a perhaps related level, when is batch_size used? get_gradients just goes to the gradients backend. So in some sense does the keras wrapper modify the loss function to just include the batch at the current iteration? Where is this done? I would be interested in adaptively modifying the batch_size in the optimizer, eventually.

slavakung commented 7 years ago

Please see on stackoverflow:

http://stackoverflow.com/questions/43364584/keras-coding-a-custom-optimizer-and-attempting-to-compute-a-second-gradient-in/43469461#43469461

in particular I received the response: "When you write gradsb = self.get_gradients(loss, [tf.Variable(a) for a in params]) you are defining a new tf.Variable for each a in params. Because the loss does not depend on these new variables, your gradients are None.

If you want to compute a second gradient you need to make sure that you're computing it with respect to Tensors that the objective does depend on."

and then I wrote:

"Apparently even replacing the current vector of parameters is not OK!! If I type this in the code:

grads = self.get_gradients(loss, params)
tempparam = [tf.Variable(a) for a in params]
params = [tf.add(a,a) for a in params]

gradsn = self.get_gradients(loss, params)
for a in gradsn:
    print(a)
params = [tf.Variable(a) for a in tempparam]

The result is still that None is printed!!

I know you understand what I am trying to do, at each iteration of get_updates, I would like to compute the gradients at a (slightly) different value of the parameter tensors, and use that to construct the update to the parameters for optimization and training. Is there any way to do this within the keras package?"

Perhaps it is entirely impossible to compute a gradient of the loss function with respect to its parameters at a different value than the current one, in some ways the get_updates is binded to only consider one (the current value) of the parameter. If this is not the case, please assist me in computing a second gradient tensor in my optimization code. If this is indeed the case, then I suggest this be investigated and corrected in a future release, as there are many optimization methods that do not involve just one computation of a gradient at each iteration and this effectively blocks all such methods from being tested on training the models, a serious impediment to research and practice involving keras.

bstriner commented 7 years ago

@slavakung there is a fundamental shift when you are working with the GPU. Your python code constructs a computation graph and then calls it. Please read through the Theano/TF docs as much as you can or ask on those forums as this is not a Keras issue. GPU programming is kind of like meta-programming. You are writing python code to make GPU code to do what you want.

There is only one iteration of get_updates ever. That one iteration constructs a graph of computations that is run on the GPU many times.

Keras create a list of parameters. get_updates returns a mapping of changes to those parameters. That data is all built into a program that is run on the GPU.

The loss in your code is not a number. It is a computation graph that calculates the loss based on some inputs. If you try to calculate the gradient wrt some other inputs, then what do you expect it to do?

You're basically saying "y = 3x why is dy/dz None?"

Good luck. I'd suggest reading the theano docs as some background.

Cheers

bstriner commented 7 years ago

@slavakung I see what you're getting at in the last paragraph but it was the wrong approach.

loss is a function graph based on param1. You can't calculate dloss/dparam2. However, you can create loss2 by replacing every instance of param1 in the loss1 graph with param2. That would result in a new loss value because the variables have been replaced. You could then calculate dloss2/dparam2.

This is the kind of thing you have to do for some types of momentum or whatever you're doing. If you want to take the gradient at several points, you need to calculate the loss at several points. The fastest way to do so is to create the loss graph once, and use variable replacement to calculate the loss given certain substitutions.

Check theano docs for clone using the replace parameter. http://deeplearning.net/software/theano/library/

Here is an example of unrolled optimization using this kind of replacement technique.

https://github.com/bstriner/keras-adversarial/blob/master/keras_adversarial/unrolled_optimizer.py https://github.com/bstriner/keras-adversarial/blob/master/keras_adversarial/backend/theano_backend.py

Cheers

slavakung commented 7 years ago

Thank you for the response. Note I am using the tensorflow backend. I get the idea that everything is a graph that is constructed and then everything is computed once a session is run. What is preventing me from constructing get_updates in such a way that for instance, computes gradients of loss with respect to param, then changes param with some computation, then computes the gradients again, then replaces param again with what it was originally? Or how would you recommend doing two evaluations just working inside the optimizer class function get_updates? Or is there no way to do so using keras, and I have to create two graphs with tensorflow and code the training and everything "manually" (as in directly with tensorflow instead of with the keras wrapping)? It does seem rather inefficient two create two entire graphs just to compute two gradients in each iteration of training, no?

bstriner commented 7 years ago

This works well in keras but you need to use TF or theano specific code and be a little creative.

tf.contrib.graph_editor.graph_replace(f, replace) is roughly what you are looking for. I haven't tested tf as much as theano so can't really speak to specifics.

Keras layers have their weights as attributes. The weights are put into the graph as you build up your computation. By the time you are in your optimizer, the weights are already stuffed into your loss. You can't just take the gradient of that loss wrt some other parameters.

There are two approaches depending on exactly what you are doing:

Calculate gradient of loss wrt parameters as usual. Then use substitution to get the gradients at some other value.
Use substitution to get loss given new parameters. Then calculate gradient of new loss wrt parameters.

To understand the difference:

import theano
import theano.tensor as T
x = theano.shared(np.float32(7))
y1 = T.sum(x**3, axis=None)
g1 = T.grad(y1, x)
x2 = x*2
y2 = theano.clone(y1, replace={x:x2})
g2 = T.grad(y2, x)
g3 = theano.clone(g1, replace={x:x2})

_g1 = theano.function([], g1)
_g2 = theano.function([], g2)
_g3= theano.function([], g3)

G1: 147.0, G2: 1176.0, G3: 588.0

gradient at x=7 is 147.
Substituting x*2 and taking the gradient wrt x gives 1176. This is the gradient at x=14 times 2 (dx*2/x).
Computing gradient then substituting x=2x gives 588, which is the gradient at x=14.

So in your optimizer, maybe you want to calculate the gradients at some other values:

def get_updates(self, params, constraints, loss):
  newparams = [p+1 for p in params] # new parameters are old parameters + 1
  replace = {p:np for p,np in zip(params, newparams)} # replacement dict
  grads = self.get_gradients(loss, params) # old gradients
  newgrads = [theano.replace(g, replace=replace) for g in grads] # gradients at new parameters
  newupdates = [K.update(p, p+g) for p,g in zip(params, newgrads)] # calculate updates
  return newupdates

As I mentioned, you need graph_replace in tensorflow but I don't have a tensorflow install handy so you're on your own for the tensorflow specifics.

You would have the same issue if you were using raw theano or raw tensorflow. If you want to calculate the loss at two different sets of parameters you either have to

Build the entire graph several times
Build the graph once and use substitution to get variations

Cheers

slavakung commented 7 years ago

Thank you very much. Good that there is this replace feature, as it does seem alternatively one would have to reconstruct everything together in the process of training, which would seem rather excessive.

I attempted to test this as,

def get_updates(self, params, constraints, loss):
    grads = self.get_gradients(loss, params)
    tempparams = [tf.add(a,a) for a in params]
    replace = {p:npm for p, npm in zip(params, tempparams)}
    gradsn = [tf.contrib.graph_editor.graph_replace(g, replace) for g in grads]

Unfortunately I got an error

gradsn = [tf.contrib.graph_editor.graph_replace(g, replace) for g in grads]

File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/graph_editor/transform.py", line 701, in graph_replace control_ios=control_ios) File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/graph_editor/select.py", line 557, in get_walks_intersection_ops control_outputs=control_outputs) File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/graph_editor/select.py", line 415, in get_forward_walk_ops seed_ops = util.make_list_of_op(seed_ops, allow_graph=False) File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/graph_editor/util.py", line 233, in make_list_of_op get_unique_graph(ops, check_types=check_types) File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/graph_editor/util.py", line 195, in get_unique_graph t) for t in check_types]), type(op))) TypeError: Expected a type in (<class 'tensorflow.python.framework.ops.Operation'>), got: <class 'tensorflow.python.ops.variables.Variable'>

it seems I am using the function as intended, https://www.tensorflow.org/api_docs/python/tf/contrib/graph_editor/graph_replace it seems this was made to be a replacemnent for theano.replace in TF from seeing pull requests here

bstriner commented 7 years ago

Tensorflow has this weird distinction between ops and variables. Every variable has an op attribute. Try using g.op or something similar. You're close but probably need something small like that. Just keep debugging and it will work eventually.

Try the tensorflow github or forums for more details on getting replace to work.

More complicated example: https://github.com/bstriner/keras-adversarial/blob/master/keras_adversarial/backend/tensorflow_backend.py

If you want to start a feature request for pulling replace into the Keras backend it wouldn't be a terrible idea.

Cheers, Ben

bstriner commented 7 years ago

TF variables are not part of the computation graph. Every variable is associated with a read operation that is part of the graph. So you aren't replacing variables, you're replacing the read operations associated with variables. Try print model.trainable_weights[0] vs print model.trainable_weights[0].op.

Hope that helps. TF syntax is a lot worse than theano for manipulating the graph.

Cheers

stale[bot] commented 7 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 30 days if no further activity occurs, but feel free to re-open a closed issue if needed.

RyanCV commented 7 years ago

@bstriner Hi, I found you said 'Use substitution to get loss given new parameters. Then calculate gradient of new loss wrt parameters.'. But I found none of the examples in _g1, _g2, _g3 is relating how to substitute the loss function and calculate the loss value on given new parameters. Could you show an toy example? Thanks.

namp commented 7 years ago

Was anyone able to solve this problem?

I even copied the function 'clone_replace' from @bstriner 's code and utilized it like this:

grads = self.get_gradients(loss, params)
tempparams = [tf.add(a,a) for a in params]
replace = {p:npm for p, npm in zip(params, tempparams)}
gradsn = [clone_replace(g, replace) for g in grads]

This give the error:

TypeError: Expected a type in (<class 'tensorflow.python.framework.ops.Operation'>), got: <class 'tensorflow.python.ops.variables.Variable'>

If I use the op attribute like

gradsn = [clone_replace(g.op, replace) for g in grads]

I get the error:

TypeError: Expected a type in (<class 'tensorflow.python.framework.ops.Tensor'>), got: <class 'tensorflow.python.framework.ops.Operation'>

Utterly confusing...