Adding Stochastic Weight Averaging (SWA)

kiudee commented 6 years ago

Since SWA was successful for Leela Zero in producing stronger network weights (see https://github.com/gcp/leela-zero/issues/814, https://github.com/gcp/leela-zero/issues/1030), I want to record this as a possible improvement here.

What is Stochastic Weight Averaging?

Izmailov et al. (2018) discovered that SGD explores regions of the weight space where networks with good performance lie, but does not reach the central point. By tracking a running average of the mean weights, they were able to find better weights than those found by SGD alone. They also demonstrate that SWA leads to solutions in wider optima, which is conjectured to be important for generalization.

Here is a comparison of SWA and SGD with a ResNet-110 on CIFAR-100: screenshot-2018-3-21 1803 05407 pdf 3

Implementation

The implementation is trivially easy, because the only thing we need to do is to update a running average of the weights in addition to the current weight vector. Since we use batch normalization, we also need to calculate the running means and standard deviations for the resulting network.

The algorithm can be seen here: screenshot-2018-3-21 1803 05407 pdf 1

The authors recommend starting with a pretrained model, before starting to average the weights. This we get for free, since we always initialize with the last best network.

Error323 commented 6 years ago

This makes a lot of sense to me. Nice! Let's do it!

Error323 commented 6 years ago

Let's do it once our trainingwindow contains solely V2 chunks. Then I can train the same net in parallel.

remdu commented 6 years ago

The paper says you may need to run the resulting net on the training set in order to get batchnorm weights too. In LZ it seems like it's not strictly necessary though.

glinscott / leela-chess

Adding Stochastic Weight Averaging (SWA) #159

What is Stochastic Weight Averaging?

Implementation