Extraction of internal features from the tower representation

kblomdahl commented 5 years ago

We should investigate the internal features of the tower representation. Doing this should provide several benefits:

It should help us identify good features that we can compute directly, and feed into the neural network.
It should help us improve the neural network architecture.

kblomdahl commented 5 years ago

The default network architecture (ResNet) seems to have an odd bias towards a strong activation close to the edge of the board. As can be observed in the diagram below:

This behavior makes sense, consider a set of weights W that are normally distributed, and a positive input x:

y = W [ x₁ x₂ x₃ x₄ x₅ x₆ x₇ x₈ x₉ ]

If W ∈ N(μ, σ) and μ < 0 then along the edge x₇, x₈, and x₉ are zero:

W [ x₁ x₂ x₃ x₄ x₅ x₆ x₇ x₈ x₉ ] ≤ W [ x₁ x₂ x₃ x₄ x₅ x₆ 0 0 0 ]

This hypothesis can be confirmed by investigating the final weights, and checking the mean of the convolution layers:

{
    "01_upsample/conv_1": {
        "mean": -0.0003626140533015132,
        "std": 0.00519569544121623
    },
    "02_residual/conv_1": {
        "mean": -0.00011233328405069187,
        "std": 0.002601742511615157
    },
    "02_residual/conv_2": {
        "mean": -0.00011419910879340023,
        "std": 0.0026016614865511656
    },
    "03_residual/conv_1": {
        "mean": -0.00010304508032277226,
        "std": 0.0026021271478384733
    },
    "03_residual/conv_2": {
        "mean": -6.648160342592746e-05,
        "std": 0.002603317843750119
    },
    "04_residual/conv_1": {
        "mean": -0.0003849344211630523,
        "std": 0.0025755600072443485
    },
    "04_residual/conv_2": {
        "mean": 0.00016120942018460482,
        "std": 0.002599172294139862
    },
    "05_residual/conv_1": {
        "mean": 0.00021969537192489952,
        "std": 0.002594882855191827
    },
    "05_residual/conv_2": {
        "mean": 0.00045546123874373734,
        "std": 0.002564028138294816
    },
    "06_residual/conv_1": {
        "mean": -0.00022857918520458043,
        "std": 0.0025941154453903437
    },
    "06_residual/conv_2": {
        "mean": 0.00020613202650565654,
        "std": 0.0025959957856684923
    },
    "07_residual/conv_1": {
        "mean": -0.0020295707508921623,
        "std": 0.001631724531762302
    },
    "07_residual/conv_2": {
        "mean": -2.099659468512982e-05,
        "std": 0.002604082226753235
    },
    "08_residual/conv_1": {
        "mean": -0.0021902318112552166,
        "std": 0.001408747280947864
    },
    "08_residual/conv_2": {
        "mean": 7.066841476444097e-07,
        "std": 0.002604166977107525
    },
    "09_residual/conv_1": {
        "mean": -0.0021506529301404953,
        "std": 0.0014684603083878756
    },
    "09_residual/conv_2": {
        "mean": -2.3711704670859035e-06,
        "std": 0.002604165580123663
    },
    "10_residual/conv_1": {
        "mean": -0.0015733279287815094,
        "std": 0.002075168304145336
    },
    "10_residual/conv_2": {
        "mean": -0.00012044789764331654,
        "std": 0.0026013797614723444
    },
    "11p_policy/conv_1": {
        "mean": -0.007533475290983915,
        "std": 0.06204431504011154
    },
    "11p_policy/linear_1": {
        "mean": -0.01642797514796257,
        "std": 0.038641564548015594
    },
    "11v_value/conv_1": {
        "mean": -0.0014275580178946257,
        "std": 0.08837682008743286
    },
    "11v_value/linear_1": {
        "mean": 0.00015922913735266775,
        "std": 0.0526527501642704
    },
    "11v_value/linear_2": {
        "mean": -0.010312804952263832,
        "std": 0.06787863373756409
    },
}

As can be seen most of the convolution layers has a negative mean.

edge_tower_representation.zip

Conclusion

The current neural network architecture has a systematic bias towards the edge of the board. This problem is presumably exaggerated by the fact that we clip activations to the range [0, 6].

kblomdahl commented 5 years ago

Some approaches to solving the problem above immediately springs to mind:

Padding with a different constant instead of zero (we could pad with the median of the input distribution, a truncated normal distribution).
Use a per-activation, instead of per-channel, bias after each convolution layer.

Neither of these approaches are supported by cuDNN when running with layout NCHWVECTC and INT8x4.

Neural style transfer research has encountered similar artifacts, with a similar cause, but no conclusion beyond using super-resolution.

kblomdahl commented 5 years ago

Even if it might turn out to be tricky to implement said algorithms in cuDNN, we can still train the architectures in Tensorflow and verify whether they result in a lower loss:

Green is per-activation offset
Gray is per-channel offset

As one can observe they are effectively equivalent, with the difference being well within the margin for error.

kblomdahl commented 5 years ago

An alternative attack vector is to try and prevent the non-zero mean for the weights, at which point the zero-padding does not matter anymore.

The most likely culprit for the negative weights mean is the residual connection, which in the AlphaZero architecture looks like this, where Rₖ is the output of a residual block, and Lₖ₋₁ is the output of the previous residual connector:

Lₖ = Lₖ₋₁ + Rₖ

Rₖ ∈ N(0, 1)

Note that if Lₖ₋₁ has a non-zero variance then the variance of Lₖ will increase with k. This would normally not be a huge problem, however since we clip our activations to six the optimizer must continuously decrease the mean of Lₖ to maintain expressiveness of the network.

A solution to this is to instead interpolate between Lₖ₋₁ and Rₖ when computing the residual connector, i.e. Lₖ = α Lₖ₋₁ + (1 - α) Rₖ. A few interpretations for different value of α:

α = 0.5 - This is the current architecture without the exploding activations.
α ≠ 0.5 - This is what is called a highway network, a common value is α = 0.9.

We could also make α a trainable parameter.

Results

This change seems to have a positive effect on the final loss of the training:

Network	Value Loss	Policy Loss
Baseline	0.8162	1.936
`α = 0.5`	0.7942	1.887

It also seems to improve the actual playing strength of the engine:

dg-v060.per-channel-offset v dg-v060.highway (38/500 games)
unknown results: 9 23.68%
board size: 19   komi: 7.5
                             wins              black         white       avg cpu
dg-v060.per-channel-offset     10 26.32%       4  21.05%     6  31.58%    173.60
dg-v060.highway                19 50.00%       9  47.37%     10 52.63%    139.36
                                               13 34.21%     16 42.11%

kblomdahl commented 5 years ago

When looking at the activations after introducing the α interpolation, the border problem seems to have been largely resolved. But an un-intended side-effect of making sure that Lₖ ∈ N(0, 1), is that the quantization resolution became worse, which has negative effects on the playing strength.

If we want to preserve 99.9% of all values, assuming the values are a normal distribution with a mean of 0 and a variance of 1, then we should clip the activations at 3.09023. This is a far cry from 6.0, and shows that we are effectively just wasting half of the quantized range.

This in-efficient use of the quantized range presumably also had negative effects before introducing α, since the initial parts of the architecture were not using the entire range, resulting in a low resolution during the initial parts of the inference.

kblomdahl / dream-go

Extraction of internal features from the tower representation #34

Conclusion

Results