Closed slavakung closed 7 years ago
Please see on stackoverflow:
in particular I received the response: "When you write gradsb = self.get_gradients(loss, [tf.Variable(a) for a in params]) you are defining a new tf.Variable for each a in params. Because the loss does not depend on these new variables, your gradients are None.
If you want to compute a second gradient you need to make sure that you're computing it with respect to Tensors that the objective does depend on."
and then I wrote:
"Apparently even replacing the current vector of parameters is not OK!! If I type this in the code:
grads = self.get_gradients(loss, params)
tempparam = [tf.Variable(a) for a in params]
params = [tf.add(a,a) for a in params]
gradsn = self.get_gradients(loss, params)
for a in gradsn:
print(a)
params = [tf.Variable(a) for a in tempparam]
The result is still that None is printed!!
I know you understand what I am trying to do, at each iteration of get_updates, I would like to compute the gradients at a (slightly) different value of the parameter tensors, and use that to construct the update to the parameters for optimization and training. Is there any way to do this within the keras package?"
Perhaps it is entirely impossible to compute a gradient of the loss function with respect to its parameters at a different value than the current one, in some ways the get_updates is binded to only consider one (the current value) of the parameter. If this is not the case, please assist me in computing a second gradient tensor in my optimization code. If this is indeed the case, then I suggest this be investigated and corrected in a future release, as there are many optimization methods that do not involve just one computation of a gradient at each iteration and this effectively blocks all such methods from being tested on training the models, a serious impediment to research and practice involving keras.
@slavakung there is a fundamental shift when you are working with the GPU. Your python code constructs a computation graph and then calls it. Please read through the Theano/TF docs as much as you can or ask on those forums as this is not a Keras issue. GPU programming is kind of like meta-programming. You are writing python code to make GPU code to do what you want.
There is only one iteration of get_updates ever. That one iteration constructs a graph of computations that is run on the GPU many times.
Keras create a list of parameters. get_updates
returns a mapping of changes to those parameters. That data is all built into a program that is run on the GPU.
The loss
in your code is not a number. It is a computation graph that calculates the loss based on some inputs. If you try to calculate the gradient wrt some other inputs, then what do you expect it to do?
You're basically saying "y = 3x
why is dy/dz
None?"
Good luck. I'd suggest reading the theano docs as some background.
Cheers
@slavakung I see what you're getting at in the last paragraph but it was the wrong approach.
loss
is a function graph based on param1
. You can't calculate dloss/dparam2
. However, you can create loss2
by replacing every instance of param1
in the loss1
graph with param2
. That would result in a new loss value because the variables have been replaced. You could then calculate dloss2/dparam2
.
This is the kind of thing you have to do for some types of momentum or whatever you're doing. If you want to take the gradient at several points, you need to calculate the loss at several points. The fastest way to do so is to create the loss graph once, and use variable replacement to calculate the loss given certain substitutions.
Check theano docs for clone
using the replace
parameter.
http://deeplearning.net/software/theano/library/
Here is an example of unrolled optimization using this kind of replacement technique.
https://github.com/bstriner/keras-adversarial/blob/master/keras_adversarial/unrolled_optimizer.py https://github.com/bstriner/keras-adversarial/blob/master/keras_adversarial/backend/theano_backend.py
Cheers
Thank you for the response. Note I am using the tensorflow backend. I get the idea that everything is a graph that is constructed and then everything is computed once a session is run. What is preventing me from constructing get_updates in such a way that for instance, computes gradients of loss with respect to param, then changes param with some computation, then computes the gradients again, then replaces param again with what it was originally? Or how would you recommend doing two evaluations just working inside the optimizer class function get_updates? Or is there no way to do so using keras, and I have to create two graphs with tensorflow and code the training and everything "manually" (as in directly with tensorflow instead of with the keras wrapping)? It does seem rather inefficient two create two entire graphs just to compute two gradients in each iteration of training, no?
This works well in keras but you need to use TF or theano specific code and be a little creative.
tf.contrib.graph_editor.graph_replace(f, replace) is roughly what you are looking for. I haven't tested tf as much as theano so can't really speak to specifics.
Keras layers have their weights as attributes. The weights are put into the graph as you build up your computation. By the time you are in your optimizer, the weights are already stuffed into your loss. You can't just take the gradient of that loss wrt some other parameters.
There are two approaches depending on exactly what you are doing:
To understand the difference:
import theano
import theano.tensor as T
x = theano.shared(np.float32(7))
y1 = T.sum(x**3, axis=None)
g1 = T.grad(y1, x)
x2 = x*2
y2 = theano.clone(y1, replace={x:x2})
g2 = T.grad(y2, x)
g3 = theano.clone(g1, replace={x:x2})
_g1 = theano.function([], g1)
_g2 = theano.function([], g2)
_g3= theano.function([], g3)
G1: 147.0, G2: 1176.0, G3: 588.0
x*2
and taking the gradient wrt x gives 1176. This is the gradient at x=14 times 2 (dx*2/x
).So in your optimizer, maybe you want to calculate the gradients at some other values:
def get_updates(self, params, constraints, loss):
newparams = [p+1 for p in params] # new parameters are old parameters + 1
replace = {p:np for p,np in zip(params, newparams)} # replacement dict
grads = self.get_gradients(loss, params) # old gradients
newgrads = [theano.replace(g, replace=replace) for g in grads] # gradients at new parameters
newupdates = [K.update(p, p+g) for p,g in zip(params, newgrads)] # calculate updates
return newupdates
As I mentioned, you need graph_replace
in tensorflow but I don't have a tensorflow install handy so you're on your own for the tensorflow specifics.
You would have the same issue if you were using raw theano or raw tensorflow. If you want to calculate the loss at two different sets of parameters you either have to
Cheers
Thank you very much. Good that there is this replace feature, as it does seem alternatively one would have to reconstruct everything together in the process of training, which would seem rather excessive.
I attempted to test this as,
def get_updates(self, params, constraints, loss):
grads = self.get_gradients(loss, params)
tempparams = [tf.add(a,a) for a in params]
replace = {p:npm for p, npm in zip(params, tempparams)}
gradsn = [tf.contrib.graph_editor.graph_replace(g, replace) for g in grads]
Unfortunately I got an error
gradsn = [tf.contrib.graph_editor.graph_replace(g, replace) for g in grads]
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/graph_editor/transform.py", line 701, in graph_replace control_ios=control_ios) File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/graph_editor/select.py", line 557, in get_walks_intersection_ops control_outputs=control_outputs) File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/graph_editor/select.py", line 415, in get_forward_walk_ops seed_ops = util.make_list_of_op(seed_ops, allow_graph=False) File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/graph_editor/util.py", line 233, in make_list_of_op get_unique_graph(ops, check_types=check_types) File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/graph_editor/util.py", line 195, in get_unique_graph t) for t in check_types]), type(op))) TypeError: Expected a type in (<class 'tensorflow.python.framework.ops.Operation'>), got: <class 'tensorflow.python.ops.variables.Variable'>
it seems I am using the function as intended, https://www.tensorflow.org/api_docs/python/tf/contrib/graph_editor/graph_replace it seems this was made to be a replacemnent for theano.replace in TF from seeing pull requests here
Tensorflow has this weird distinction between ops and variables. Every variable has an op
attribute. Try using g.op
or something similar. You're close but probably need something small like that. Just keep debugging and it will work eventually.
Try the tensorflow github or forums for more details on getting replace to work.
More complicated example: https://github.com/bstriner/keras-adversarial/blob/master/keras_adversarial/backend/tensorflow_backend.py
If you want to start a feature request for pulling replace into the Keras backend it wouldn't be a terrible idea.
Cheers, Ben
TF variables are not part of the computation graph. Every variable is associated with a read operation that is part of the graph. So you aren't replacing variables, you're replacing the read operations associated with variables. Try print model.trainable_weights[0]
vs print model.trainable_weights[0].op
.
Hope that helps. TF syntax is a lot worse than theano for manipulating the graph.
Cheers
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 30 days if no further activity occurs, but feel free to re-open a closed issue if needed.
@bstriner Hi, I found you said 'Use substitution to get loss given new parameters. Then calculate gradient of new loss wrt parameters.'. But I found none of the examples in _g1, _g2, _g3
is relating how to substitute the loss function and calculate the loss value on given new parameters. Could you show an toy example? Thanks.
Was anyone able to solve this problem?
I even copied the function 'clone_replace' from @bstriner 's code and utilized it like this:
grads = self.get_gradients(loss, params)
tempparams = [tf.add(a,a) for a in params]
replace = {p:npm for p, npm in zip(params, tempparams)}
gradsn = [clone_replace(g, replace) for g in grads]
This give the error:
TypeError: Expected a type in (<class 'tensorflow.python.framework.ops.Operation'>), got: <class 'tensorflow.python.ops.variables.Variable'>
If I use the op
attribute like
gradsn = [clone_replace(g.op, replace) for g in grads]
I get the error:
TypeError: Expected a type in (<class 'tensorflow.python.framework.ops.Tensor'>), got: <class 'tensorflow.python.framework.ops.Operation'>
Utterly confusing...
Hello,
I am a researcher in optimization and I trying to write a custom optimizer. I have come across a problem.
Take any optimizer code, say just copy SGD. In the beginning of get_updates, you see
now add the following line right after this one:
this should compute the gradients at a new tensor, with all the values the same as before
now try to see what you get:
you get a list of Nones (but if you print the list grads you see that they are still Tensors)
Why? And how to circumvent this problem? This is important as I'd like to compute the gradients at another point for my algorithm.
On a perhaps related level, when is batch_size used? get_gradients just goes to the gradients backend. So in some sense does the keras wrapper modify the loss function to just include the batch at the current iteration? Where is this done? I would be interested in adaptively modifying the batch_size in the optimizer, eventually.