Open kiudee opened 6 years ago
This makes a lot of sense to me. Nice! Let's do it!
Let's do it once our trainingwindow contains solely V2 chunks. Then I can train the same net in parallel.
The paper says you may need to run the resulting net on the training set in order to get batchnorm weights too. In LZ it seems like it's not strictly necessary though.
Since SWA was successful for Leela Zero in producing stronger network weights (see https://github.com/gcp/leela-zero/issues/814, https://github.com/gcp/leela-zero/issues/1030), I want to record this as a possible improvement here.
What is Stochastic Weight Averaging?
Izmailov et al. (2018) discovered that SGD explores regions of the weight space where networks with good performance lie, but does not reach the central point. By tracking a running average of the mean weights, they were able to find better weights than those found by SGD alone. They also demonstrate that SWA leads to solutions in wider optima, which is conjectured to be important for generalization.
Here is a comparison of SWA and SGD with a ResNet-110 on CIFAR-100:
Implementation
The implementation is trivially easy, because the only thing we need to do is to update a running average of the weights in addition to the current weight vector. Since we use batch normalization, we also need to calculate the running means and standard deviations for the resulting network.
The algorithm can be seen here:
The authors recommend starting with a pretrained model, before starting to average the weights. This we get for free, since we always initialize with the last best network.