Exponential Moving Average

soheilb commented 8 years ago

It is often useful to have a second copy of the model being trained that maintains an exponential moving average of the model weights. This can results in improved performance on validation/test set and is already available in Tensorflow: https://www.tensorflow.org/versions/r0.10/api_docs/python/train.html#ExponentialMovingAverage and can be a useful feature for training with Keras.

see 'Model Ensembles" section here for an intuitive description: http://cs231n.github.io/neural-networks-3/

I implemented this for my Keras experiments using callback function: https://gist.github.com/soheilb/c5bf0ba7197caa095acfcb69744df756

Feedback is highly welcomed. I can later on submit a pull request for this if it turns out that implementing it as a callback function is a good enough solution.

kuza55 commented 8 years ago

Thanks for adding this, I learnt something today :)

What's the performance of this like, given you perform the weight decay on the CPU?

On one hand doing this on GPU is probably faster, but adding 50% GPU memory usage (double weights, but no extra gradients) is also not always great. One option could be having a consume_less parameter the same way Recurrent layers do to decide whether to run this on CPU or GPU.

fchollet commented 8 years ago

This technique is called Polyak averaging. We could call it PolyakAveragingCheckpointer or something like that. The proper way to do is to do a MAE of the weights that is updated after every batch. It should be implemented in a way similar to the moving averages in the BN layer. It isn't entirely clear if it can be achieved (efficiently) with a simple callback. The update should be made part of the main graph run.

It would definitely be a valuable addition to Keras. Have you though about allowing saving after a number of batches, instead of after every epochs? Can be useful when each epoch is very long. Same is true for the regular model checkpointer as well.

soheilb commented 8 years ago

@kuza55 Sure :) I tested this on Theano only and the transfer should happen on GPU if the GPU is selected as device. Model parameters on Keras Theano backend are created using shared variables and reside on GPU and thus the copy is being done on GPU. I am noticing very little overhead as you guessed when running on GPU, but have just tired it on lstms and this needs further testings. Moreover, the consume_less parameter for recurrent layers does not decide whether to run operations on CPU or GPU, this will be decided using Device flag on theano. It only provides implementations that are optimized for different devices (e.g. you can select consume_less to CPU and still run the layer on GPU).

soheilb commented 8 years ago

@fchollet Thanks for the comment. If you look at my gist, I am indeed updating the parameters after every batch as you mentioned. I only save the MAE copy to disk after every epoch which can be changed to save after a number of epochs as you suggested. I will check the BN layer later today to see how I can make it as part of the main graph run. Any other thoughts?

kuza55 commented 8 years ago

I'm not sure if I don't understand Theano or you misunderstood my comments, but my reading of your code is that you use K.batch_get_value to read all the weights into Python, and then compute the weighted average on the CPU, then transfer the weights back to the GPU, whereas you could use K.update etc to run the updates on the GPU without needing to bring the weights onto the CPU.

soheilb commented 8 years ago

Oh I see your point. You are right, I was mainly thinking about set_value() function.

I am not sure if we can add the moving average update operation to the main graph when using callback method. The training function for Keras models is compiled before any callback functions are called upon (see _make_train_function in engine/training.py).

One option to bring the update into main graph would be to update the compile function of the Model class and append the new update operations to self.updates. We should also take care of transferring the moving averaged weights to the original model weights at the end of training. Not sure how the API should look like then.

crowsonkb commented 8 years ago

Thanks for this!

In section 7.2 of the Adam paper they suggest using the beta_2 (default 0.999) parameter as the 'momentum parameter' in an EWMA of weights, and reference Polyak. They also give the initialization bias correction (like that used in Adam itself for momentum). I didn't understand it when I first read it and it was due to comments here that I worked it out. I had great success yesterday applying Polyak+Adam to image synthesis from convnets (neural style) - the results had far less noise than any unaveraged optimizer I tried. I tried to incorporate it into Keras' Adam optimizer and failed, since I couldn't figure out how to get gradients from something other than the averaged weights.

kuza55 commented 8 years ago

So, I'm thinking the best way to implement this in the main graph would be to create layer that replaces your output layer.

The PolyakAveragingLayer would be a simple identity pass through in terms of data, but would examine the _keras_history of its input, finding all the layers that go into it, and extracting the trainable_weights property, allocate the shadow weights and register an update_op.

The question of where to allocate the weights remains; in TF there is tf.Operation.device which would let us allocate it wherever the original is, not sure about Theano.

I'm not sure how this should integrate with validation (should these weights be used for validation?) or inference (should these be used for inference by default); or should there just be a special function on the layer to save the model to a file similar to _make_mv_model ?

fchollet commented 8 years ago

The proper way to do this is most likely to add a polyak_averaging option in compile, which would create EMA ops to be run as part of the main graph call.

When calling a test-time function, we would extract the model's weights, replace them with the EMA weights, run the predictions, then set back the initial weights.

fchollet commented 8 years ago

Note that Polyak averaging in general is useful and widespread enough that it does deserve a compile option.

crowsonkb commented 8 years ago

Polyak averaging proper takes an equal-weighted average of every past iterate - it's not an EMA. (See Polyak's paper.) I came across the EMA idea in the Adam paper. When I implemented it for image synthesis I just picked Polyak's formulation because it was easier for me to implement and with only a few hundred iterations the difference was negligible.

Likewise it seems there are two different things people do with the averaged weights: either they use them for validation and inference or they use them as part of an ensemble with the unaveraged weights and therefore want to validate and save both.

soheilb commented 8 years ago

@crowsonkb Thanks for citing relevant paper of Adam paper.

@fchollet I like polyak_averaging option in compile, maybe accepting three values 0 (default, no averaging), 1 (EMA with a decay parameter), 2 (simple average). Should we compile a separate function than train_function to make sure the EMA ops are performed only after the weights are updated by the optimizer? Any thought regarding @kuza55 's comment on device usage for this operations? In TF implementation, EMA operations are performed on same device as the original variables to lower communication bandwidth.

kuza55 commented 8 years ago

So, I sketched out an implementation of Polyak averaging as part of model.compile ( https://github.com/fchollet/keras/compare/master...kuza55:polyak ) and I don't particularly like it since it seems like the actual need to be in the core is pretty minimal. The main reason to put it in the core would be so that it can easily get all the weights and register it's update op and so that it can switch out the weights for validation/inference.

Here is an alternate proposal: A Model wrapper that takes a model, makes a copy with unshared weights (how? maybe via the model de/serialization code?), and returns a model that wraps these two models, sets its own update ops and uses the K.learning_phase() property to decide whether it should run the learning model and then update the Polyak model, or whether it should run the Polyak model, or run both and sum the output. You can then easily use it on sub-graphs too.

stale[bot] commented 7 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs, but feel free to re-open it if needed.

joeyearsley commented 7 years ago

@fchollet Is this in the roadmap for the future?

alalbiol commented 7 years ago

I also find this feature very interesting. In the meanwhile, I found an intermediate solution for my needs. Since the averaged model is only useful at the end of training, I only average weights after N steps since the beginning, reducing the overload

In my case I use the train_on_batch loop because my data does not fit in memory so it is really easy for me to implement it after the N steps for an additional number of steps

Another advantage of my approach (for me) is that I have not much GPU memory, so averaging the parameters in CPU is an advantage

stale[bot] commented 7 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 30 days if no further activity occurs, but feel free to re-open a closed issue if needed.

lishen commented 6 years ago

Stumbled upon this issue when search online. This is a really useful feature. There is already a user implemented callback here: https://github.com/alno/kaggle-allstate-claims-severity/blob/master/keras_util.py. I hope this can be officially supported.

SebiSebi commented 5 years ago

Any updates on this one? Would be really helpful :)

alexhernandezgarcia commented 5 years ago

I think this is one the most important features that Keras is missing!

hfawaz commented 5 years ago

Yes this is important indeed.

veqtor commented 5 years ago

This i think is almost required for a lot of GAN architectures like biggan and stylegan so would be almost a requirement for Tf.keras for GAN research

Squadrick commented 5 years ago

It's been implemented in TensorFlow Addons: tfa.optimizers.MovingAverage.

Compatible with tf.keras and TF 2.0.

veqtor commented 5 years ago

Great stuff, can an EMA-model exist alongside the model being trained somehow? (in certain GANs such as StyleGAN you train the Discriminator on the EMA of the Generator and vice-versa)

keras-team / keras

Exponential Moving Average #3696