Closed around1991 closed 7 years ago
tf.scatter_add
seems to be what you're looking for
Great, thanks. I think the TF way of doing it actually makes more sense (but why do you need both scatter_add
and scatter_sub
...).
I'll write a Keras backend function to handle subtensor updates then - the API will probably be more similar to the TF version than the Theano version. Does anyone have any objections?
I've implemented this now in my own version of Keras. Is there a way of submitting a pull request on only a single commit? I've not really kept my fork in proper order, and I don't want to submit a pull request on the entire thing.
You can create a new fork/branch and copy&paste your changes there.
https://github.com/around1991/keras/commit/8c7a3c2b54821fe22e5137b401f18f83658c47fc
This is the commit I want to submit a pull request for. I can't seem to create a new fork, as I've already forked Keras once, and creating a new branch just bases it off my existing master, not the upstream master. I don't really want to go through and change everything manually, as the diff is quite long. Is there a way to just submit a PR from this commit?
Branch from upstream, get it up-to-date and copy your stuff over. PRs can only be submitted for branches.
Have you benchmarked the benefits of such solution? Mainly in three aspects I think: (1) speed up over whole embedding update especially for large embedding matrix case, (2) better accuracy by using regularization on embedding layer, and (3) better accuracy by some optimizers that depend on "gradient history", like momentum, rmsprop.
A time and performance benchmark would be great.
To submit a PR: you can just checkout to master, create a new branch, and reintroduce your changes there, then PR from the new branch.
On 14 April 2016 at 21:26, chentingpc notifications@github.com wrote:
Have you benchmarked the benefits of such solution? Mainly in two aspects I think: (1) speed up over whole embedding update especially for large embedding matrix case, and (2) better accuracy by some optimizers that depend on "gradient history", like momentum, rmsprop.
— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/fchollet/keras/issues/2040#issuecomment-210282063
The main issue with doing this is that set_subtensor (and scatter_update) doesn't behave nicely if there are multiple updates for the same index in the batch: instead of combining parameter updates, it just non-deterministically picks one and applies that update.
This makes training more stochastic and (anecdotally for me) slower in terms of number of epochs needed for convergence. Each epoch is quicker time-wise though.
Can gpu use inc_subtensor for your case? It add all the value for the same index as you want. Le 18 avr. 2016 07:19, "around1991" notifications@github.com a écrit :
The main issue with doing this is that set_subtensor (and scatter_update) doesn't behave nicely if there are multiple updates for the same index in the batch: instead of combining parameter updates, it just non-deterministically picks one and applies that update.
This makes training more stochastic and (anecdotally for me) slower in terms of number of epochs needed for convergence. Each epoch is quicker time-wise though.
— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/fchollet/keras/issues/2040#issuecomment-211335645
Hi nouiz
The issue is that sometimes you do just have to use set_subtensor: for instance, the update rule for momentum in SGD+momentum reads
new_momentum = momentum * old_momentum - lr * g
and there's no way of rewriting this so that one can use inc_subtensor.
I'm pretty sure you can do something. I don't have time to go see the detail and do it, but you can at worst do the increment before doing the set_subtensor.
On Mon, Apr 18, 2016 at 10:08 AM, around1991 notifications@github.com wrote:
Hi nouiz
The issue is that sometimes you do just have to use set_subtensor: for instance, the update rule for momentum in SGD+momentum reads
new_momentum = momentum * old_momentum - lr * g
and there's no way of rewriting this so that one can use inc_subtensor.
— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/fchollet/keras/issues/2040#issuecomment-211396432
just confirming one thing with @nouiz , if I use u = W[0], v = W[0], loss = f(u, v), and take T.grad(loss, [u, v]), will du, dv be different if u, v are used differently in computation graph? or they will be same no mater what since they both indexed the same real location in W?
du and dv will be different if they do different computation in the computation of "loss". Mostly, Theano grad will see that they are differente instance of variable (and won't see they are equivalent) and computat a grad for each of them.
On Mon, Apr 18, 2016 at 10:30 AM, chentingpc notifications@github.com wrote:
@around1991 https://github.com/around1991 you can first do momentum * old_momentum, then use inc_subtensor.
just confirming one thing with @nouiz https://github.com/nouiz , if I use u = W[0], v = W[0], loss = f(u, v), and take T.grad(loss, [u, v]), will du, dv be different if u, v are used differently in computation graph? or they will be same no mater what since they both indexed the same real location in W?
— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/fchollet/keras/issues/2040#issuecomment-211404704
@nouiz:
u = W[indeces]
loss = f(u)
g = T.grad(loss, W[indeces])
If it works, what's the value of g
?
@NasenSpray: it's the subtensor consisting of the gradients of f wrt the subindexed elements of W. g.shape = indeces.shape, most importantly.
You should use:
u = W[indeces] loss = f(u) g = T.grad(loss, u)
The small difference is important. The grad method only check the instance of each variable in the graph. Not the equivalent computation. Your version will crash by telling W[indeces] are not used to compute to loss.
See this doc page: http://deeplearning.net/software/theano/tutorial/faq_tutorial.html?highlight=inc_subtensor#how-to-update-a-subset-of-weights
On Mon, Apr 18, 2016 at 10:43 AM, NasenSpray notifications@github.com wrote:
@nouiz https://github.com/nouiz:
u = W[indeces] loss = f(u) g = T.grad(loss, W[indeces])
If it works, what's the value of g?
— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/fchollet/keras/issues/2040#issuecomment-211409390
The small difference is important. The grad method only check the instance of each variable in the graph. Not the equivalent computation. Your version will crash by telling W[indeces] are not used to compute to loss.
@nouiz: Okay, that's what I thought.
u = W[indeces]
loss = f(u)
g_u = T.grad(loss, u)
g_W = T.inc_subtensor(W.zeros_like()[indeces], g_u)
assert g_W.shape == W.shape # yes?
deduplicated_indeces = T.inc_subtensor(W.zeros_like()[indeces], 1).nonzero() # is there a better way?
g_W_reduced = g_W[deduplicated_indeces]
W_reduced = W[deduplicated_indeces]
next_W_reduced = do_optimizer_stuff(W_reduced, g_W_reduced)
next_W = T.set_subtensor(W[deduplicated_indeces], next_W_reduced)
Would this work?
[Edit] Or would this be better/faster?
u = W[indeces]
loss = f(u)
g_W = T.grad(loss, W)
dedup_indeces = T.inc_subtensor(W.zeros_like()[indeces], 1).nonzero()
next_W = T.set_subtensor(W[dedup_indeces],
do_optimizer_stuff(W[dedup_indeces], g_W[dedup_indeces]))
I have a simple implementation of partial updates for embedding layers. My benchmarks on rating prediction problems using DNNs show it can improve the speed under some conditions (i.e. only partial embeddings are used), e.g. in netflix 100m ratings can be ~10% faster, and the performance wouldn't be any different. But I haven't test with other tasks, like LSTM.
Hello, I ran into this issue as well. Any updates?
As #1021 says, it would be good to only update the components of an Embedding layer that fire during a forward pass. In Theano, this can be implemented using the inc_subtensor method, but this would horrendously break Tensorflow compatibility.
Is there a similar function available on Tensorflow so I can try and make a solution that works across both backends? If I ever get it done, I can try and submit a pull request for my solution - the main promise is a massive speedup in models with big embeddings!
Please make sure that the boxes below are checked before you submit your issue. Thank you!