Partially update parameters for the Embedding layer?

chenych11 commented 8 years ago

A very import feature needed for word embedding is that word embeddings are updated partially depending on the input indexes, especially when the vocabulary is very large. The current code seems that it updates embeddings of all words in vocabulary for each mini-batch input. We can calculate the gradients w.r.t a subtensor instead of the whole tensor to get this feature, as the official document shows: http://deeplearning.net/software/theano/tutorial/faq_tutorial.html. There are other mertials also show that updating with subtensor speed up weight update. For example: https://github.com/Theano/Theano/issues/3342

However, it seems that there is no easy way to fullfil this. The problem is that layer.params must be a list of shared variables, (unfortunately, subtensors are not shared variables), otherwise the optimizer would raise an exception. Some tricks can be done to implement this feature, but they introudce dependencies between Layers and Optimizers. What a pity! Any ideas?

dbonadiman commented 8 years ago

You are raising a big problem indeed that needs to be solved since it is causing the code to be incorrect when using optimisations functions with accumulators. As pointed out in the faq.

dbonadiman commented 8 years ago

I hate doing that but i think that this issue should be pointed to @fchollet to understand possible changes. To be honest this things is wrong in all the library out there (Blocks, Lasagne), but i think it need to be fixed.

lemuriandezapada commented 8 years ago

Since the embeddings are "shared" any change for one word is a change for all words. Or am I understanding the problem wrong?

dbonadiman commented 8 years ago

@lemuriandezapada The problem is that at each step the updates are calculated for all the embeddings in the dictionary. Assume that you have 1M words in your Embeddings matrix each time you back propagate you compute the updates for 1M Embeddings when most of the times only a few of them are seen in the batch examples. This is always an overhead. Sometimes is even worse since Embeddings updates are mostly zeros update (They do not "fire" in the forward pass so they have to be seen as not present in the network at that moment) using some optimizer that is based on the history of updates with accumulators leads to bad updates. In fact it may cause problem when the previously unseen word is finally present, since It has an history of zeros updates the optimiser assumes that the weight is "good" so it is updated only a little or nothing at all.

I hope i'm clear enough.

dbonadiman commented 8 years ago

I did a simple evaluation on my dataset just to highlight the differences, i preloaded the embeddings matrix with the mikolov embeddings. I then trained the network two times one updating the embedding and one that do not. Then i used two different optimisation methods Adadelta that uses the history of updates and Adam that do not. And these are the result:

Adadelta:
Fixed Emb:         Acc: 0.6416502057
Dynamic Emb:    Acc: 0.6421389015

Adam:
Fixed Emb:          Acc: 0.6570553647
Dynamic Emb:     Acc: 0.6859934147

Despite Adam perform slightly better even without updating the embeddings i think that the difference in the refinement phase is quite significant. @fchollet have you some ideas on how to fix this issue in this library?

elbamos commented 8 years ago

I'm just seeing this and it may explain some things...

Am I correct in thinking that this means that the current Embedding class is broken if there's regularization applied to the embedding layer or any dependent layer?

around1991 commented 8 years ago

Does anyone have an update on this?

chentingpc commented 8 years ago

@elbamos yes, regularization on embeddings would likely lead to suboptimal results. And many related weird things also depend on batch size, causing some "unexplainable" phenomenons.

chentingpc commented 7 years ago

UPDATE: this issue can be resolved by using tensorflow backend with tensorflow optimizer, e.g. tf.train.GradientDescentOptimizer. As the gradient of tf.gather function will be sparse. In cases of large embedding weight W, this can produce very significant speedup (e.g. 5x or more).

stale[bot] commented 7 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs, but feel free to re-open it if needed.

keras-team / keras

Partially update parameters for the Embedding layer? #1021