Stochastic weight averaging

I experimented a bit with SWA, using an EMA with a constant learning rate rather than the exact KataGo methodology. My work is in branch swa.

On a per-generation basis, this did not make learning worse, but it wasn't clearly better. Measured by overall runtime, however, it was clearly worse. Each generation took significantly longer because I had to do another pass through the dataset on each generation to recalibrate the batch normalization layers of the network (via torch.optim.swa_utils.update_bn()). Without the update_bn() call, network quality was quite clearly worse.

I think I can avoid the separate update_bn() call by maintaining batch normalization layer stats "online" while doing the main forward pass. However, the lack of clear improvement even on a per-generation basis discouraged me from going further down this path.

It's quite possible that either:

I have a conceptual misunderstanding
I have a bug
Experimenting further with parameters or non-constant learning rate schedules will work

shindavid / AlphaZeroArcade

Stochastic weight averaging #28