Closed kblomdahl closed 5 years ago
The default network architecture (ResNet) seems to have an odd bias towards a strong activation close to the edge of the board. As can be observed in the diagram below:
This behavior makes sense, consider a set of weights W
that are normally distributed, and a positive input x
:
y = W [ x₁ x₂ x₃ x₄ x₅ x₆ x₇ x₈ x₉ ]
If W ∈ N(μ, σ)
and μ < 0
then along the edge x₇
, x₈
, and x₉
are zero:
W [ x₁ x₂ x₃ x₄ x₅ x₆ x₇ x₈ x₉ ] ≤ W [ x₁ x₂ x₃ x₄ x₅ x₆ 0 0 0 ]
This hypothesis can be confirmed by investigating the final weights, and checking the mean of the convolution layers:
{
"01_upsample/conv_1": {
"mean": -0.0003626140533015132,
"std": 0.00519569544121623
},
"02_residual/conv_1": {
"mean": -0.00011233328405069187,
"std": 0.002601742511615157
},
"02_residual/conv_2": {
"mean": -0.00011419910879340023,
"std": 0.0026016614865511656
},
"03_residual/conv_1": {
"mean": -0.00010304508032277226,
"std": 0.0026021271478384733
},
"03_residual/conv_2": {
"mean": -6.648160342592746e-05,
"std": 0.002603317843750119
},
"04_residual/conv_1": {
"mean": -0.0003849344211630523,
"std": 0.0025755600072443485
},
"04_residual/conv_2": {
"mean": 0.00016120942018460482,
"std": 0.002599172294139862
},
"05_residual/conv_1": {
"mean": 0.00021969537192489952,
"std": 0.002594882855191827
},
"05_residual/conv_2": {
"mean": 0.00045546123874373734,
"std": 0.002564028138294816
},
"06_residual/conv_1": {
"mean": -0.00022857918520458043,
"std": 0.0025941154453903437
},
"06_residual/conv_2": {
"mean": 0.00020613202650565654,
"std": 0.0025959957856684923
},
"07_residual/conv_1": {
"mean": -0.0020295707508921623,
"std": 0.001631724531762302
},
"07_residual/conv_2": {
"mean": -2.099659468512982e-05,
"std": 0.002604082226753235
},
"08_residual/conv_1": {
"mean": -0.0021902318112552166,
"std": 0.001408747280947864
},
"08_residual/conv_2": {
"mean": 7.066841476444097e-07,
"std": 0.002604166977107525
},
"09_residual/conv_1": {
"mean": -0.0021506529301404953,
"std": 0.0014684603083878756
},
"09_residual/conv_2": {
"mean": -2.3711704670859035e-06,
"std": 0.002604165580123663
},
"10_residual/conv_1": {
"mean": -0.0015733279287815094,
"std": 0.002075168304145336
},
"10_residual/conv_2": {
"mean": -0.00012044789764331654,
"std": 0.0026013797614723444
},
"11p_policy/conv_1": {
"mean": -0.007533475290983915,
"std": 0.06204431504011154
},
"11p_policy/linear_1": {
"mean": -0.01642797514796257,
"std": 0.038641564548015594
},
"11v_value/conv_1": {
"mean": -0.0014275580178946257,
"std": 0.08837682008743286
},
"11v_value/linear_1": {
"mean": 0.00015922913735266775,
"std": 0.0526527501642704
},
"11v_value/linear_2": {
"mean": -0.010312804952263832,
"std": 0.06787863373756409
},
}
As can be seen most of the convolution layers has a negative mean.
The current neural network architecture has a systematic bias towards the edge of the board. This problem is presumably exaggerated by the fact that we clip activations to the range [0, 6]
.
Some approaches to solving the problem above immediately springs to mind:
Neither of these approaches are supported by cuDNN when running with layout NCHWVECTC
and INT8x4
.
Neural style transfer research has encountered similar artifacts, with a similar cause, but no conclusion beyond using super-resolution.
Even if it might turn out to be tricky to implement said algorithms in cuDNN, we can still train the architectures in Tensorflow and verify whether they result in a lower loss:
As one can observe they are effectively equivalent, with the difference being well within the margin for error.
An alternative attack vector is to try and prevent the non-zero mean for the weights, at which point the zero-padding does not matter anymore.
The most likely culprit for the negative weights mean is the residual connection, which in the AlphaZero architecture looks like this, where Rₖ
is the output of a residual block, and Lₖ₋₁
is the output of the previous residual connector:
Lₖ = Lₖ₋₁ + Rₖ
Rₖ ∈ N(0, 1)
Note that if Lₖ₋₁
has a non-zero variance then the variance of Lₖ
will increase with k
. This would normally not be a huge problem, however since we clip our activations to six the optimizer must continuously decrease the mean of Lₖ
to maintain expressiveness of the network.
A solution to this is to instead interpolate between Lₖ₋₁
and Rₖ
when computing the residual connector, i.e. Lₖ = α Lₖ₋₁ + (1 - α) Rₖ
. A few interpretations for different value of α:
α = 0.5
- This is the current architecture without the exploding activations.α ≠ 0.5
- This is what is called a highway network, a common value is α = 0.9
.We could also make α
a trainable parameter.
This change seems to have a positive effect on the final loss of the training:
Network | Value Loss | Policy Loss |
---|---|---|
Baseline | 0.8162 | 1.936 |
α = 0.5 |
0.7942 | 1.887 |
It also seems to improve the actual playing strength of the engine:
dg-v060.per-channel-offset v dg-v060.highway (38/500 games)
unknown results: 9 23.68%
board size: 19 komi: 7.5
wins black white avg cpu
dg-v060.per-channel-offset 10 26.32% 4 21.05% 6 31.58% 173.60
dg-v060.highway 19 50.00% 9 47.37% 10 52.63% 139.36
13 34.21% 16 42.11%
When looking at the activations after introducing the α interpolation, the border problem seems to have been largely resolved. But an un-intended side-effect of making sure that Lₖ ∈ N(0, 1)
, is that the quantization resolution became worse, which has negative effects on the playing strength.
If we want to preserve 99.9% of all values, assuming the values are a normal distribution with a mean of 0 and a variance of 1, then we should clip the activations at 3.09023. This is a far cry from 6.0, and shows that we are effectively just wasting half of the quantized range.
This in-efficient use of the quantized range presumably also had negative effects before introducing α, since the initial parts of the architecture were not using the entire range, resulting in a low resolution during the initial parts of the inference.
We should investigate the internal features of the tower representation. Doing this should provide several benefits: