keras-team / keras

Deep Learning for humans
http://keras.io/
Apache License 2.0
61.97k stars 19.46k forks source link

Tensorflow equivalent of inc_subtensor #2040

Closed around1991 closed 7 years ago

around1991 commented 8 years ago

As #1021 says, it would be good to only update the components of an Embedding layer that fire during a forward pass. In Theano, this can be implemented using the inc_subtensor method, but this would horrendously break Tensorflow compatibility.

Is there a similar function available on Tensorflow so I can try and make a solution that works across both backends? If I ever get it done, I can try and submit a pull request for my solution - the main promise is a massive speedup in models with big embeddings!

Please make sure that the boxes below are checked before you submit your issue. Thank you!

NasenSpray commented 8 years ago

tf.scatter_add seems to be what you're looking for

around1991 commented 8 years ago

Great, thanks. I think the TF way of doing it actually makes more sense (but why do you need both scatter_add and scatter_sub...).

I'll write a Keras backend function to handle subtensor updates then - the API will probably be more similar to the TF version than the Theano version. Does anyone have any objections?

around1991 commented 8 years ago

I've implemented this now in my own version of Keras. Is there a way of submitting a pull request on only a single commit? I've not really kept my fork in proper order, and I don't want to submit a pull request on the entire thing.

NasenSpray commented 8 years ago

You can create a new fork/branch and copy&paste your changes there.

around1991 commented 8 years ago

https://github.com/around1991/keras/commit/8c7a3c2b54821fe22e5137b401f18f83658c47fc

This is the commit I want to submit a pull request for. I can't seem to create a new fork, as I've already forked Keras once, and creating a new branch just bases it off my existing master, not the upstream master. I don't really want to go through and change everything manually, as the diff is quite long. Is there a way to just submit a PR from this commit?

NasenSpray commented 8 years ago

Branch from upstream, get it up-to-date and copy your stuff over. PRs can only be submitted for branches.

chentingpc commented 8 years ago

Have you benchmarked the benefits of such solution? Mainly in three aspects I think: (1) speed up over whole embedding update especially for large embedding matrix case, (2) better accuracy by using regularization on embedding layer, and (3) better accuracy by some optimizers that depend on "gradient history", like momentum, rmsprop.

fchollet commented 8 years ago

A time and performance benchmark would be great.

To submit a PR: you can just checkout to master, create a new branch, and reintroduce your changes there, then PR from the new branch.

On 14 April 2016 at 21:26, chentingpc notifications@github.com wrote:

Have you benchmarked the benefits of such solution? Mainly in two aspects I think: (1) speed up over whole embedding update especially for large embedding matrix case, and (2) better accuracy by some optimizers that depend on "gradient history", like momentum, rmsprop.

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/fchollet/keras/issues/2040#issuecomment-210282063

around1991 commented 8 years ago

The main issue with doing this is that set_subtensor (and scatter_update) doesn't behave nicely if there are multiple updates for the same index in the batch: instead of combining parameter updates, it just non-deterministically picks one and applies that update.

This makes training more stochastic and (anecdotally for me) slower in terms of number of epochs needed for convergence. Each epoch is quicker time-wise though.

nouiz commented 8 years ago

Can gpu use inc_subtensor for your case? It add all the value for the same index as you want. Le 18 avr. 2016 07:19, "around1991" notifications@github.com a écrit :

The main issue with doing this is that set_subtensor (and scatter_update) doesn't behave nicely if there are multiple updates for the same index in the batch: instead of combining parameter updates, it just non-deterministically picks one and applies that update.

This makes training more stochastic and (anecdotally for me) slower in terms of number of epochs needed for convergence. Each epoch is quicker time-wise though.

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/fchollet/keras/issues/2040#issuecomment-211335645

around1991 commented 8 years ago

Hi nouiz

The issue is that sometimes you do just have to use set_subtensor: for instance, the update rule for momentum in SGD+momentum reads

new_momentum = momentum * old_momentum - lr * g

and there's no way of rewriting this so that one can use inc_subtensor.

nouiz commented 8 years ago

I'm pretty sure you can do something. I don't have time to go see the detail and do it, but you can at worst do the increment before doing the set_subtensor.

On Mon, Apr 18, 2016 at 10:08 AM, around1991 notifications@github.com wrote:

Hi nouiz

The issue is that sometimes you do just have to use set_subtensor: for instance, the update rule for momentum in SGD+momentum reads

new_momentum = momentum * old_momentum - lr * g

and there's no way of rewriting this so that one can use inc_subtensor.

— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/fchollet/keras/issues/2040#issuecomment-211396432

chentingpc commented 8 years ago

just confirming one thing with @nouiz , if I use u = W[0], v = W[0], loss = f(u, v), and take T.grad(loss, [u, v]), will du, dv be different if u, v are used differently in computation graph? or they will be same no mater what since they both indexed the same real location in W?

nouiz commented 8 years ago

du and dv will be different if they do different computation in the computation of "loss". Mostly, Theano grad will see that they are differente instance of variable (and won't see they are equivalent) and computat a grad for each of them.

On Mon, Apr 18, 2016 at 10:30 AM, chentingpc notifications@github.com wrote:

@around1991 https://github.com/around1991 you can first do momentum * old_momentum, then use inc_subtensor.

just confirming one thing with @nouiz https://github.com/nouiz , if I use u = W[0], v = W[0], loss = f(u, v), and take T.grad(loss, [u, v]), will du, dv be different if u, v are used differently in computation graph? or they will be same no mater what since they both indexed the same real location in W?

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/fchollet/keras/issues/2040#issuecomment-211404704

NasenSpray commented 8 years ago

@nouiz:

u = W[indeces]
loss = f(u)
g = T.grad(loss, W[indeces])

If it works, what's the value of g?

around1991 commented 8 years ago

@NasenSpray: it's the subtensor consisting of the gradients of f wrt the subindexed elements of W. g.shape = indeces.shape, most importantly.

nouiz commented 8 years ago

You should use:

u = W[indeces] loss = f(u) g = T.grad(loss, u)

The small difference is important. The grad method only check the instance of each variable in the graph. Not the equivalent computation. Your version will crash by telling W[indeces] are not used to compute to loss.

See this doc page: http://deeplearning.net/software/theano/tutorial/faq_tutorial.html?highlight=inc_subtensor#how-to-update-a-subset-of-weights

On Mon, Apr 18, 2016 at 10:43 AM, NasenSpray notifications@github.com wrote:

@nouiz https://github.com/nouiz:

u = W[indeces] loss = f(u) g = T.grad(loss, W[indeces])

If it works, what's the value of g?

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/fchollet/keras/issues/2040#issuecomment-211409390

NasenSpray commented 8 years ago

The small difference is important. The grad method only check the instance of each variable in the graph. Not the equivalent computation. Your version will crash by telling W[indeces] are not used to compute to loss.

@nouiz: Okay, that's what I thought.

u = W[indeces]
loss = f(u)
g_u = T.grad(loss, u)
g_W = T.inc_subtensor(W.zeros_like()[indeces], g_u)
assert g_W.shape == W.shape # yes?
deduplicated_indeces = T.inc_subtensor(W.zeros_like()[indeces], 1).nonzero() # is there a better way?
g_W_reduced = g_W[deduplicated_indeces]
W_reduced = W[deduplicated_indeces]
next_W_reduced = do_optimizer_stuff(W_reduced, g_W_reduced)
next_W = T.set_subtensor(W[deduplicated_indeces], next_W_reduced)

Would this work?


[Edit] Or would this be better/faster?

u = W[indeces]
loss = f(u)
g_W = T.grad(loss, W)
dedup_indeces = T.inc_subtensor(W.zeros_like()[indeces], 1).nonzero()
next_W = T.set_subtensor(W[dedup_indeces],
                         do_optimizer_stuff(W[dedup_indeces], g_W[dedup_indeces]))
chentingpc commented 8 years ago

I have a simple implementation of partial updates for embedding layers. My benchmarks on rating prediction problems using DNNs show it can improve the speed under some conditions (i.e. only partial embeddings are used), e.g. in netflix 100m ratings can be ~10% faster, and the performance wouldn't be any different. But I haven't test with other tasks, like LSTM.

wddabc commented 7 years ago

Hello, I ran into this issue as well. Any updates?