Investigate @lightvector neural network enhancements

kblomdahl / dream-go

Artificial go player based on reinforcement and supervised learning

Apache License 2.0

47 stars 8 forks source link

Investigate @lightvector neural network enhancements #25

Closed kblomdahl closed 5 years ago

kblomdahl commented 6 years ago

Check out some of ideas mentioned here to enhance the neural network:

@lightvector https://github.com/lightvector/GoNN

Initial thoughts on the concepts without any further research to back it up:

Global Pooled Properties This should be easy to implement and makes a lot of sense. Unclear if we want to add them early, before each residual block, or only in the policy and value head. Need to experiment a bit with this to tell.
Parametric ReLUs Unfortunately there is no support for PReLU in cuDNN so we cannot do this. Otherwise I've had similar experiences a long time ago when I was playing around with neural network architectures for Go where PReLU is just better. I used to experiment with other activation functions, like selu, but they were not as good.
Chain Pooling This probably gives worse results since the pooling destroy the local shape information. Might be interesting to do a dense block approach where each residual blocks becomes a DenseNet [1]. This way each residual block would benefit from both the pooling and the local shape:
1. Given input x compute the the chain pooling of each channel and store the result in yₛ.
2. Concatenate x and yₛ channel-wise into yₓₛ (so yₓₛ has shape [256, 19, 19]).
3. y₁ <- relu(C(W₁, yₓₛ) + b₁) (y₁ has shape [128, 19, 19])
4. Concatenate y₁ and yₛ channel-wise into y₁ₛ (so y₁ₛ has shape [256, 19, 19]).
5. y₂ <- relu(C(W₂, y₁) + x + b₂)
6. return y₂
I suspect this is too expensive to do in practice (no good support for it in cuDNN), but it is a very interesting idea.

kblomdahl commented 6 years ago

All these enhancements are for the policy network, which is generally doing pretty well even without these tweaks. I would be more interesting in seeing if this improves the value head, in which you can observe the following issue consistently:

It has issues identifying whether large dragons are alive or dead.

Following the logic posted in lightvector's repository one could reasonably expect Parametric ReLUs and Global Pooling Properties to have little effect on this problem. However Chain Pooling might greatly help with this issue.

To help with the performance impact, one could limit the chain pooling to the final convolutional layer in the value head with input x:

y₀ ← max chain pooling of x
y₁ ← relu(bn(C(W₁, [x y₀])) + b₁)
y₂ ← relu(W₂ y₁ + b₂)
y₃ ← tanh(W₃ y₂ + b₃)

This is only one layer of chain pooling so we expect it to have issues identifying problems with loosely connected groups or certain types of false eyes. But it would probably help in most situations.

To implement this we would have to do a custom CUDA kernel, which could probably be done relatively easily using some flood filling technique:

int global_index = threadIdx.x;

do {
    if (global_index == 0) {
        *is_dirty = 0;
    }

    __syncthreads();  // barrier

    for (int i = 0; i < batch_size; ++i) {
        for (int c = 0; c < num_channels; ++c) {
            const float original = data[c, global_index];

            if (chain[global_index] == chain[N[global_index]])
                data[i, c, global_index] = max(data[i, c, global_index], data[i, c, N[global_index]]);
            if (chain[global_index] == chain[E[global_index]])
                data[i, c, global_index] = max(data[i, c, global_index], data[i, c, E[global_index]]);
            if (chain[global_index] == chain[S[global_index]])
                data[i, c, global_index] = max(data[i, c, global_index], data[i, c, S[global_index]]);
            if (chain[global_index] == chain[W[global_index]])
                data[i, c, global_index] = max(data[i, c, global_index], data[i, c, W[global_index]]);

            if (original < data[c, global_index])
                *is_dirty = 1;
        }
    }

    __syncthreads();  // barrier
} while (*is_dirty > 0);

kblomdahl commented 6 years ago

Lightvector updated his blog with some more result, of especial interest is the fact that adding dilation has a similar effect as adding chain pooling [1] but since you only apply dilation over some of the channel local shape information is not lost. This is very promising since dilation is built into the convolutional operator in most frameworks, including cuDNN.

The problem is that according to the cuDNN documentation, only the ImplicitPrecompGemm algorithm supports dilation > 1, and based on previous benchmarks Winograd provides a 4-5x performance improvement. If we want to run two parallel convolutions, one with a dilation of 1 (normal convolution), and one with a dilation of 2 (or larger) with no performance loss then the later filter must be 4-5 times smaller than the previous. So if we want a total of c input and output channels, then we can only reserve c / 5.5 channels for dilation:

For some common filter sizes this gives (rounded down to the closest multiple of 8 for SIMD purposes). We want some balance between local and global shape anyway, so this might provide a good mixture between the two:

Channels	Normal	Dilation
128	112	16
192	160	32
256	216	40

Lightvectors observations about history [2] is also very interesting, not because of the reason he mentions. But because his suggestion to zero out the history channels randomly can act as a training data augmentation, which should help with potential overfitting.

I am particularly interested in this because I've observed the same behaviour he cites, where the neural network learns sequences of moves instead of judging each individual board position separately. It is understandable why it does so (humans does this too), but it is not a desirable property and I've been considering getting ride of the history features completely to avoid this. Unfortunately the history features are very important, so this might provide a reasonable in-between.

One could even take this a step further, where if one were to use one-hot encodings of the history planes, then one could shuffle the history planes (with care, to avoid illegal board positions) in order to provide a sort of tewari-like effect.

[1] https://github.com/lightvector/GoNN#dilated-convolutions-mar-2018 [2] https://github.com/lightvector/GoNN#some-thoughts-about-history-as-an-input-mar-2018

kblomdahl commented 6 years ago

With a naive cuDNN implementation the performance hit is quite significant, even when computing the normal and dilated convolution in parallel the neural network with dilation is about 36% slower.

However even when using a network that was only trained for a few hours it succeeds one of the dead dragon tests (and in the other two the neural network is less certain about who is winning than it has historically been). No other networks has managed this to date, so even without any performance improvements I might be able to squeeze out dilation is probably worth it:

running 7 tests
test ladder_2 ... ok
test ladder_3 ... ok
test ladder_1 ... FAILED
test dead_dragon_1 ... ok
test dead_dragon_2 ... FAILED
test dead_dragon_3 ... FAILED
test end_1 ... ok

For reference, this network trained for 5,901 steps (using a batch size of 512) and achieved a 40.27% policy accuracy and a 59.96% value accuracy after about 1 hour and 17 minutes of training. Based on previous experience these numbers should improve significantly with more training.

Output from nvprof of my implementation. Some observations when comparing to a profile from before we started using dilations:

In the API call section cudaLaunch has increased from 6.94% to 12.01%, suggesting the overhead associated with launching kernels is a problem. Maybe it is time to write a fused kernel for each residual block.
We spent about the same amount of time computing the winograd calculations in the two profilings (22.9687s vs 22.1559s). This suggests that we did not properly adjust the number of channels to account for the tile size of winograd.
The concatenations adds about 2.78% overhead, which is acceptable even if we would prefer to avoid it if possible.

==3041== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
 45.02%  22.1559s     83793  264.41us  26.880us  8.1167ms  maxwell_scudnn_winograd_128x128_ldg1_ldg4_relu_tile418n_nt
 28.89%  14.2193s     48654  292.25us  137.64us  4.0314ms  void cudnn::detail::implicit_convolve_sgemm<float, float, int=128, int=5, int=5, int=3, int=3, int=3, int=1, bool=1, bool=0, bool=0>(int, int, int, float const *, int, float*, cudnn::detail::implicit_convolve_sgemm<float, float, int=128, int=5, int=5, int=3, int=3, int=3, int=1, bool=1, bool=0, bool=0>*, kernel_conv_params, int, float, float, int, float, float, int, int)
  6.23%  3.06499s     18909  162.09us  131.03us  1.8853ms  void cudnn::detail::implicit_convolve_sgemm<float, float, int=1024, int=5, int=5, int=3, int=3, int=3, int=1, bool=1, bool=0, bool=0>(int, int, int, float const *, int, float*, cudnn::detail::implicit_convolve_sgemm<float, float, int=1024, int=5, int=5, int=3, int=3, int=3, int=1, bool=1, bool=0, bool=0>*, kernel_conv_params, int, float, float, int, float, float, int, int)
  4.34%  2.13685s     57038  37.463us  13.143us  1.7758ms  maxwell_scudnn_winograd_128x128_ldg1_ldg4_relu_tile148n_nt
  3.90%  1.92040s     72104  26.633us  4.9370us  1.7498ms  void op_tensor_kernel<int=2, float, float, float, int=32, int=1, int=4, int=2, cudnnOpTensorOp_t=2, cudnnNanPropagation_t=0>(cudnnTensorStruct, float*, cudnnTensorStruct, float const *, cudnnTensorStruct, float const *, float, float, float)
  2.90%  1.42483s    140831  10.117us  5.0360us  1.4300ms  void cudnn::winograd::generateWinogradTilesKernel<int=0, float, float>(cudnn::winograd::GenerateWinogradTilesParams<float, float>)
  2.78%  1.36813s     59454  23.011us  6.2130us  1.8411ms  void cudnn::detail::cubeTransposeDeviceGeneric_kernel<float, float, float, int=8, int=8, int=8, int=8, int=8, int=11, int=5>(int, int, int, int, int, int, int, int, int, float, float const *, float*)
  1.12%  549.89ms     64317  8.5490us     621ns  144.96us  [CUDA memcpy HtoD]
  1.05%  514.99ms    105078  4.9010us  3.0140us  469.83us  void op_tensor_kernel<int=2, float, float, float, int=128, int=1, int=1, int=4, cudnnOpTensorOp_t=2, cudnnNanPropagation_t=0>(cudnnTensorStruct, float*, cudnnTensorStruct, float const *, cudnnTensorStruct, float const *, float, float, float)
  0.87%  427.14ms     48028  8.8930us  4.6970us  566.26us  void op_tensor_kernel<int=2, float, float, float, int=64, int=1, int=2, int=4, cudnnOpTensorOp_t=2, cudnnNanPropagation_t=0>(cudnnTensorStruct, float*, cudnnTensorStruct, float const *, cudnnTensorStruct, float const *, float, float, float)
  0.75%  368.82ms     15014  24.565us  15.042us  500.34us  maxwell_scudnn_128x32_relu_interior_nn
  0.69%  338.68ms      8109  41.766us  38.356us  526.71us  void genericTranspose_kernel<float, float, float>(float, cudnnTensorStruct, float const *, float, cudnnTensorStruct, float*)
  0.30%  147.44ms      1802  81.817us  49.605us  115.07us  maxwell_scudnn_128x128_relu_small_nn
  0.18%  90.139ms      7806  11.547us  6.8670us  140.41us  void gemmSN_NN_kernel<float, float, float, int=128, int=2, int=4, int=8, int=4, int=4>(cublasGemmSmallNParams<float, float, float>, float const *, float const *, float, float, int)
  0.18%  90.130ms     22521  4.0020us  2.2430us  9.8300us  void add_tensor_kernel_v3<int=2, float, float, int=16, int=16, int=1, int=16, int=4>(cudnnTensorStruct, float*, cudnnTensorStruct, float const *, float, float)
  0.16%  77.102ms     71600  1.0760us     436ns  129.61us  [CUDA memcpy DtoH]
  0.13%  64.344ms      3903  16.485us  13.473us  411.39us  void gemmSN_NN_kernel<float, float, float, int=256, int=4, int=2, int=8, int=4, int=4>(cublasGemmSmallNParams<float, float, float>, float const *, float const *, float, float, int)
  0.11%  52.684ms      7507  7.0170us  4.2180us  1.3240ms  void cudnn::detail::softmax_fw_kernel_resident<int=2, float, float, int=256, int=1, int=0, int=0, int=32, int=0>(cudnnTensorStruct, float const *, cudnn::detail::softmax_fw_kernel_resident<int=2, float, float, int=256, int=1, int=0, int=0, int=32, int=0>, float*, int, float, float*, int, int)
  0.08%  41.624ms      4202  9.9050us  6.6090us  13.739us  void gemmSN_NN_kernel<float, float, float, int=128, int=2, int=4, int=8, int=2, int=4>(cublasGemmSmallNParams<float, float, float>, float const *, float const *, float, float, int)
  0.07%  35.603ms     16816  2.1170us  1.4370us  4.9380us  cudnn::maxwell::gemm::computeOffsetsKernel(cudnn::maxwell::gemm::ComputeOffsetsParams)
  0.07%  32.421ms      1806  17.951us  7.1290us  26.356us  sgemm_32x32x32_NN
  0.06%  29.867ms      2101  14.215us  13.186us  15.316us  void gemmSN_NN_kernel<float, float, float, int=256, int=4, int=2, int=8, int=2, int=4>(cublasGemmSmallNParams<float, float, float>, float const *, float const *, float, float, int)
  0.05%  23.351ms      7507  3.1100us  2.5180us  158.30us  void cudnn::detail::activation_fw_4d_kernel<float, float, int=128, int=1, int=4, cudnn::detail::tanh_func<float>>(cudnnTensorStruct, float const *, cudnn::detail::activation_fw_4d_kernel<float, float, int=128, int=1, int=4, cudnn::detail::tanh_func<float>>, cudnnTensorStruct*, float, cudnnTensorStruct*, int, cudnnTensorStruct*)
  0.04%  21.188ms      7507  2.8220us  2.3150us  9.3850us  void cudnn::detail::activation_fw_4d_kernel<float, float, int=128, int=1, int=4, cudnn::detail::relu_func<float, cudnnNanPropagation_t=0, bool=0>>(cudnnTensorStruct, float const *, cudnn::detail::activation_fw_4d_kernel<float, float, int=128, int=1, int=4, cudnn::detail::relu_func<float, cudnnNanPropagation_t=0, bool=0>>, cudnnTensorStruct*, float, cudnnTensorStruct*, int, cudnnTensorStruct*)
  0.03%  16.783ms      2703  6.2080us  2.4880us  12.723us  void gemv2N_kernel_val<float, float, float, int=128, int=32, int=4, int=4, int=1>(float, float, cublasGemv2Params_v2<float, float, float>)
  0.00%  4.5730us         7     653ns     513ns     943ns  [CUDA memset]

==3041== API calls:
Time(%)      Time     Calls       Avg       Min       Max  Name
 81.47%  35.8463s    135561  264.43us  5.7860us  1.37762s  cudaMemcpyAsync
 12.01%  5.28529s    743193  7.1110us  3.6370us  1.34247s  cudaLaunch
  1.60%  701.94ms        56  12.535ms  8.0670us  701.29ms  cudaStreamCreateWithFlags
  1.38%  605.11ms    435406  1.3890us     701ns  1.4727ms  cudaEventRecord
  1.02%  450.81ms   4647754      96ns      69ns  323.22us  cudaSetupArgument
  0.59%  259.18ms    285266     908ns     739ns  316.25us  cudaStreamWaitEvent
  0.45%  197.32ms       605  326.14us     743ns  188.05ms  cudaFree
  0.43%  188.54ms     67563  2.7900us  2.5210us  291.60us  cudaBindTexture
  0.42%  183.41ms       391  469.09us  3.3660us  174.74ms  cudaMalloc
  0.21%  91.363ms    743193     122ns      74ns  303.63us  cudaConfigureCall
  0.20%  85.856ms    760009     112ns      67ns  1.4508ms  cudaGetLastError
  0.16%  69.804ms     67563  1.0330us     943ns  287.59us  cudaUnbindTexture
  0.05%  21.321ms     15014  1.4200us     951ns  306.74us  cudaStreamSynchronize
  0.02%  7.0849ms       356  19.901us  3.4990us  359.56us  cudaMemcpy
  0.01%  2.5367ms       147  17.256us  7.9430us  188.10us  cudaStreamCreate
  0.00%  2.0855ms       210  9.9300us  1.5840us  247.39us  cudaStreamDestroy
  0.00%  943.45us       277  3.4050us     254ns  138.42us  cuDeviceGetAttribute
  0.00%  709.33us         7  101.33us  15.754us  603.54us  cudaHostAlloc
  0.00%  467.68us         7  66.811us  7.8000us  405.69us  cudaFreeHost
  0.00%  363.21us       483     751ns     583ns  3.0420us  cudaEventDestroy
  0.00%  269.90us       287     940ns     769ns  5.0480us  cudaEventCreateWithFlags
  0.00%  228.96us         3  76.321us  73.634us  80.725us  cuDeviceTotalMem
  0.00%  202.74us       196  1.0340us     824ns  2.2250us  cudaEventCreate
  0.00%  153.88us       263     585ns     508ns  1.7050us  cudaDeviceGetAttribute
  0.00%  107.14us         3  35.713us  31.368us  40.518us  cuDeviceGetName
  0.00%  66.623us         7  9.5170us  7.7900us  12.701us  cudaStreamCreateWithPriority
  0.00%  48.792us        28  1.7420us  1.6170us  2.1620us  cudaThreadSynchronize
  0.00%  43.996us         7  6.2850us  5.4910us  7.5090us  cudaMemsetAsync
  0.00%  24.781us        21  1.1800us     601ns  1.8960us  cudaGetDevice
  0.00%  15.818us         7  2.2590us  1.8510us  3.6860us  cudaDeviceSynchronize
  0.00%  10.316us         7  1.4730us  1.1910us  2.5770us  cudaHostGetDevicePointer
  0.00%  9.1960us         7  1.3130us  1.0590us  1.5700us  cudaDeviceGetStreamPriorityRange
  0.00%  2.4040us         5     480ns     273ns  1.0940us  cuDeviceGetCount
  0.00%  2.0860us         5     417ns     292ns     506ns  cuDeviceGet
  0.00%  1.8050us         2     902ns     789ns  1.0160us  cuInit
  0.00%     899ns         2     449ns     445ns     454ns  cuDriverGetVersion
  0.00%     123ns         1     123ns     123ns     123ns  cudaRuntimeGetVersion

lightvector commented 6 years ago

Out of curiosity, is it faster if instead of concatenating you add after the next convolution? Using the following identity or similar: conv3x3(concat(x,y), [x_channels+y_channels, output_channels]) = conv3x3(x, [x_channels, output_channels]) + conv3x3(y, [y_channels, output_channels])

It's probably worse to do it this way because the next convolution ends up split up so you lose benefits of greater 'batching', but if concat is particularly expensive for some reason then there's an off-chance it's better.

kblomdahl commented 6 years ago

I will try it, since my current implementation does not really have good memory access patterns due to the concatenation forcing me to temporarily re-write them as CNHW. Your re-formulation would allow us to keep NCHW the entire way, at the expense of more kernel launches. The sum of convolutions is also really fast to calculate since cuDNN fuses that into the convolution kernel (by allowing you to blend the input and output arrays).

I think I also screwed up my SIMD multiplier since if one looks at the runtime of a Winograd kernel over different number of output channels you can see some clear bumps on the graph where the number of outputs channels are a multiple of 32:

runtime_over_channels

If you are curious my dilation implementation at the moment is pretty much the following, notice how both c, d, and y needs to either read or write using a sub-optimal memory layout:

c <- transpose(conv3x3(x, [in_channels, c_channels]), [1, 0, 2, 3])  # as a fused op by specifying the strides of c to the transpose
d <- transpose(conv3x3(x, [in_channels, d_channels]), [1, 0, 2, 3])  # as a fused op by specifying the strides of d to the transpose
y <- transpose(c ++ d, [1, 0, 2, 3])  # list concatenation (fused since c and d is in continuous memory) followed by transpose
...  # continue as normal with y

For the sake of transparency these are the benchmark number before I added dilation:

test batch_size_01 ... bench:     810,021 ns/iter (+/- 55,133)
test batch_size_02 ... bench:   1,159,539 ns/iter (+/- 49,392)
test batch_size_04 ... bench:   1,675,931 ns/iter (+/- 60,894)
test batch_size_08 ... bench:   3,327,576 ns/iter (+/- 90,876)
test batch_size_16 ... bench:   6,670,638 ns/iter (+/- 364,274)
test batch_size_32 ... bench:  13,517,297 ns/iter (+/- 509,426)
test batch_size_64 ... bench:  27,505,117 ns/iter (+/- 929,449)

These are the current benchmark numbers using the algorithm described in the previous section:

test batch_size_01 ... bench:   3,174,957 ns/iter (+/- 281,523)
test batch_size_02 ... bench:   2,547,549 ns/iter (+/- 185,185)
test batch_size_04 ... bench:   2,984,062 ns/iter (+/- 280,443)
test batch_size_08 ... bench:   4,251,221 ns/iter (+/- 349,213)
test batch_size_16 ... bench:   8,162,312 ns/iter (+/- 750,470)
test batch_size_32 ... bench:  16,532,217 ns/iter (+/- 3,610,558)
test batch_size_64 ... bench:  33,221,823 ns/iter (+/- 3,071,336)

kblomdahl commented 6 years ago

I finished my mock-up implementation of the two ideas mentioned above and they are of mixed success. Changing the channel count was slightly better, while avoiding the concatenation does not seem to be worth it (probably due to the lack of a good fused kernel) and batching.

Channels as a multiple of 32

This gave a performance improvement of 6%, so nothing groundbreaking but a solid improvement:

test batch_size_01 ... bench:   2,582,608 ns/iter (+/- 85,706)
test batch_size_02 ... bench:   2,466,245 ns/iter (+/- 91,409)
test batch_size_04 ... bench:   2,885,751 ns/iter (+/- 156,826)
test batch_size_08 ... bench:   4,103,070 ns/iter (+/- 199,810)
test batch_size_16 ... bench:   7,735,140 ns/iter (+/- 267,049)
test batch_size_32 ... bench:  15,522,847 ns/iter (+/- 711,294)
test batch_size_64 ... bench:  31,670,363 ns/iter (+/- 571,301)

For the sake of completion this is a trace of the CUDA calls performed during a single residual block (for a batch size of 256). You can clearly see the issue being that the convolution and dilation both takes the same amount of time, so adding dilation effectively increased the amount of work during each residual block from 2 to 3. This match up with the observations elsewhere as it would predict a 33% performance loss:

   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput           Device   Context    Stream  Name
...
100.850s  1.1007ms            (722 1 1)         (8 8 1)        55  2.2500KB        0B         -           -  GeForce GTX 108         1       275  void cudnn::detail::implicit_convolve_sgemm<float, float, int=128, int=5, int=5, int=3, int=3, int=3, int=1, bool=1, bool=0, bool=0>(int, int, int, float const *, int, float*, cudnn::detail::implicit_convolve_sgemm<float, float, int=128, int=5, int=5, int=3, int=3, int=3, int=1, bool=1, bool=0, bool=0>*, kernel_conv_params, int, float, float, int, float, float, int, int) [7493139]
100.850s  44.693us             (3 32 1)        (32 4 1)        40  8.5000KB        0B         -           -  GeForce GTX 108         1       274  void cudnn::winograd::generateWinogradTilesKernel<int=0, float, float>(cudnn::winograd::GenerateWinogradTilesParams<float, float>) [7493102]
100.850s  1.2420ms            (48 10 2)       (256 1 1)       128  48.000KB        0B         -           -  GeForce GTX 108         1       274  maxwell_scudnn_winograd_128x128_ldg1_ldg4_relu_tile418n_nt [7493106]
100.852s  51.348us            (16 32 1)        (32 1 4)        32        0B        0B         -           -  GeForce GTX 108         1       275  void op_tensor_kernel<int=2, float, float, float, int=32, int=1, int=4, int=2, cudnnOpTensorOp_t=2, cudnnNanPropagation_t=0>(cudnnTensorStruct, float*, cudnnTensorStruct, float const *, cudnnTensorStruct, float const *, float, float, float) [7493152]
100.852s  97.768us            (16 96 1)        (32 1 4)        32        0B        0B         -           -  GeForce GTX 108         1       274  void op_tensor_kernel<int=2, float, float, float, int=32, int=1, int=4, int=2, cudnnOpTensorOp_t=2, cudnnNanPropagation_t=0>(cudnnTensorStruct, float*, cudnnTensorStruct, float const *, cudnnTensorStruct, float const *, float, float, float) [7493118]
100.852s  172.69us            (46 9 17)         (8 8 8)         8  2.7500KB        0B         -           -  GeForce GTX 108         1       263  void cudnn::detail::cubeTransposeDeviceGeneric_kernel<float, float, float, int=8, int=8, int=8, int=8, int=8, int=11, int=5>(int, int, int, int, int, int, int, int, int, float, float const *, float*) [7493172]
100.852s  15.996us             (4 32 1)        (32 4 1)        40  8.5000KB        0B         -           -  GeForce GTX 108         1       263  void cudnn::winograd::generateWinogradTilesKernel<int=0, float, float>(cudnn::winograd::GenerateWinogradTilesParams<float, float>) [7493176]
100.852s  1.3144ms            (64 10 2)       (256 1 1)       128  48.000KB        0B         -           -  GeForce GTX 108         1       263  maxwell_scudnn_winograd_128x128_ldg1_ldg4_relu_tile418n_nt [7493180]
100.853s  123.39us           (16 128 1)        (32 1 4)        32        0B        0B         -           -  GeForce GTX 108         1       263  void op_tensor_kernel<int=2, float, float, float, int=32, int=1, int=4, int=2, cudnnOpTensorOp_t=2, cudnnNanPropagation_t=0>(cudnnTensorStruct, float*, cudnnTensorStruct, float const *, cudnnTensorStruct, float const *, float, float, float) [7493192]

Alternative formulation of concatenation

This turned out to be a bit problematic to implement this as we needed to not do a rectified linear unit on the final result of the addition of two convolutions. This probably does not sound too hard but we are using the fused operator cudnnConvolutionBiasActivationForward which does the work of three kernels in one. In order to implement this we provided an alternative path which splits this into two separate calls to cudnnConvolutionForward, and cudnnAddTensor (followed by one cudnnActivationForward on the final sum), so we turned one kernel into five (could be made into four as the two bias weights can be merged).

There is also, as observed by lightvector, less batching with this approach which is typically bad for performance. Interestingly enough this approach has a systematic advantage for a batch size of one:

test batch_size_01 ... bench:   2,205,420 ns/iter (+/- 46,547)
test batch_size_02 ... bench:   2,548,588 ns/iter (+/- 83,218)
test batch_size_04 ... bench:   3,160,420 ns/iter (+/- 335,384)
test batch_size_08 ... bench:   4,500,350 ns/iter (+/- 78,394)
test batch_size_16 ... bench:   7,852,688 ns/iter (+/- 359,734)
test batch_size_32 ... bench:  16,937,519 ns/iter (+/- 600,127)
test batch_size_64 ... bench:  34,805,903 ns/iter (+/- 738,911)

The profiling output for this approach suggest the bottleneck are the two non-fused convolutional kernels (note the lack of a relu suffix in the first kernel):

Time(%)      Time     Calls       Avg       Min       Max  Name
 27.19%  8.35323s     32472  257.24us  22.703us  5.5128ms  maxwell_scudnn_winograd_128x128_ldg1_ldg4_tile418n_nt
 24.92%  7.65386s     29745  257.32us  114.96us  4.4123ms  void cudnn::detail::implicit_convolve_sgemm<float, float, int=128, int=5, int=5, int=3, int=3, int=3, int=1, bool=1, bool=0, bool=0>(int, int, int, float const *, int, float*, cudnn::detail::implicit_convolve_sgemm<float, float, int=128, int=5, int=5, int=3, int=3, int=3, int=1, bool=1, bool=0, bool=0>*, kernel_conv_params, int, float, float, int, float, float, int, int)
 19.35%  5.94467s     25248  235.45us  24.578us  3.6630ms  maxwell_scudnn_winograd_128x128_ldg1_ldg4_relu_tile418n_nt
  7.23%  2.22092s     86558  25.658us  4.2700us  2.2978ms  void op_tensor_kernel<int=2, float, float, float, int=32, int=1, int=4, int=2, cudnnOpTensorOp_t=2, cudnnNanPropagation_t=0>(cudnnTensorStruct, float*, cudnnTensorStruct, float const *, cudnnTensorStruct, float const *, float, float, float)
  4.54%  1.39320s     48690  28.613us  3.6250us  3.0429ms  void add_tensor_kernel_v3<int=2, float, float, int=32, int=1, int=4, int=2, int=2>(cudnnTensorStruct, float*, cudnnTensorStruct, float const *, float, float)
  3.94%  1.21004s     18020  67.149us  43.379us  2.0338ms  maxwell_scudnn_128x128_relu_small_nn
  2.37%  729.19ms      5418  134.59us  124.24us  488.78us  void cudnn::detail::implicit_convolve_sgemm<float, float, int=1024, int=5, int=5, int=3, int=3, int=3, int=1, bool=1, bool=0, bool=0>(int, int, int, float const *, int, float*, cudnn::detail::implicit_convolve_sgemm<float, float, int=1024, int=5, int=5, int=3, int=3, int=3, int=1, bool=1, bool=0, bool=0>*, kernel_conv_params, int, float, float, int, float, float, int, int)
  2.28%  701.67ms     91376  7.6780us  3.6250us  1.7249ms  void cudnn::winograd::generateWinogradTilesKernel<int=0, float, float>(cudnn::winograd::GenerateWinogradTilesParams<float, float>)
  2.06%  631.92ms     39070  16.173us  1.6280us  1.4267ms  void cudnn::detail::activation_fw_4d_kernel<float, float, int=128, int=1, int=4, cudnn::detail::relu_func<float, cudnnNanPropagation_t=0, bool=0>>(cudnnTensorStruct, float const *, cudnn::detail::activation_fw_4d_kernel<float, float, int=128, int=1, int=4, cudnn::detail::relu_func<float, cudnnNanPropagation_t=0, bool=0>>, cudnnTensorStruct*, float, cudnnTensorStruct*, int, cudnnTensorStruct*)
  1.37%  421.88ms     53598  7.8710us     609ns  125.44us  [CUDA memcpy HtoD]
  1.00%  308.42ms     21636  14.255us  9.2160us  332.77us  maxwell_scudnn_winograd_128x128_ldg1_ldg4_tile148n_nt
  0.87%  268.24ms     12020  22.315us  12.811us  756.82us  maxwell_scudnn_winograd_128x128_ldg1_ldg4_relu_tile148n_nt
  0.68%  208.19ms      7814  26.643us  13.517us  427.71us  maxwell_scudnn_128x32_relu_interior_nn
  0.60%  184.39ms     49278  3.7410us  2.6420us  199.54us  void op_tensor_kernel<int=2, float, float, float, int=128, int=1, int=1, int=4, cudnnOpTensorOp_t=2, cudnnNanPropagation_t=0>(cudnnTensorStruct, float*, cudnnTensorStruct, float const *, cudnnTensorStruct, float const *, float, float, float)
  0.29%  87.697ms     16537  5.3030us  3.8090us  11.305us  void op_tensor_kernel<int=2, float, float, float, int=64, int=1, int=2, int=4, cudnnOpTensorOp_t=2, cudnnNanPropagation_t=0>(cudnnTensorStruct, float*, cudnnTensorStruct, float const *, cudnnTensorStruct, float const *, float, float, float)
  0.19%  59.357ms     57218  1.0370us     430ns  113.24us  [CUDA memcpy DtoH]
  0.18%  55.777ms     25834  2.1590us  1.3210us  13.579us  cudnn::maxwell::gemm::computeOffsetsKernel(cudnn::maxwell::gemm::ComputeOffsetsParams)
  0.16%  49.578ms     11721  4.2290us  1.8130us  140.49us  void add_tensor_kernel_v3<int=2, float, float, int=16, int=16, int=1, int=16, int=4>(cudnnTensorStruct, float*, cudnnTensorStruct, float const *, float, float)
  0.15%  45.971ms      4206  10.929us  6.7900us  16.528us  void gemmSN_NN_kernel<float, float, float, int=128, int=2, int=4, int=8, int=4, int=4>(cublasGemmSmallNParams<float, float, float>, float const *, float const *, float, float, int)
  0.13%  41.387ms     16218  2.5510us  2.3650us  11.090us  void add_tensor_kernel_v3<int=2, float, float, int=128, int=1, int=1, int=4, int=2>(cudnnTensorStruct, float*, cudnnTensorStruct, float const *, float, float)
  0.11%  34.524ms      2103  16.416us  13.456us  20.675us  void gemmSN_NN_kernel<float, float, float, int=256, int=4, int=2, int=8, int=4, int=4>(cublasGemmSmallNParams<float, float, float>, float const *, float const *, float, float, int)
  0.11%  33.634ms      3907  8.6080us  4.0550us  1.5123ms  void cudnn::detail::softmax_fw_kernel_resident<int=2, float, float, int=256, int=1, int=0, int=0, int=32, int=0>(cudnnTensorStruct, float const *, cudnn::detail::softmax_fw_kernel_resident<int=2, float, float, int=256, int=1, int=0, int=0, int=32, int=0>, float*, int, float, float*, int, int)
  0.09%  28.488ms      1806  15.773us  6.4810us  24.448us  sgemm_32x32x32_NN
  0.05%  15.373ms      5418  2.8370us  2.5020us  10.538us  void add_tensor_kernel_v3<int=2, float, float, int=64, int=1, int=2, int=4, int=2>(cudnnTensorStruct, float*, cudnnTensorStruct, float const *, float, float)
  0.05%  15.072ms      2703  5.5750us  2.1200us  11.871us  void gemv2N_kernel_val<float, float, float, int=128, int=32, int=4, int=4, int=1>(float, float, cublasGemv2Params_v2<float, float, float>)
  0.04%  12.135ms      3907  3.1060us  2.3350us  4.8130us  void cudnn::detail::activation_fw_4d_kernel<float, float, int=128, int=1, int=4, cudnn::detail::tanh_func<float>>(cudnnTensorStruct, float const *, cudnn::detail::activation_fw_4d_kernel<float, float, int=128, int=1, int=4, cudnn::detail::tanh_func<float>>, cudnnTensorStruct*, float, cudnnTensorStruct*, int, cudnnTensorStruct*)
  0.02%  5.4156ms       602  8.9960us  6.2360us  12.256us  void gemmSN_NN_kernel<float, float, float, int=128, int=2, int=4, int=8, int=2, int=4>(cublasGemmSmallNParams<float, float, float>, float const *, float const *, float, float, int)
  0.01%  3.9482ms       301  13.117us  12.596us  13.700us  void gemmSN_NN_kernel<float, float, float, int=256, int=4, int=2, int=8, int=2, int=4>(cublasGemmSmallNParams<float, float, float>, float const *, float const *, float, float, int)
  0.00%  4.5140us         7     644ns     527ns     817ns  [CUDA memset]

==43157== API calls:
Time(%)      Time     Calls       Avg       Min       Max  Name
 86.08%  29.2245s    110361  264.81us  5.7750us  1.37722s  cudaMemcpyAsync
  8.62%  2.92516s    562608  5.1990us  3.7800us  5.4377ms  cudaLaunch
  2.15%  728.72ms        56  13.013ms  7.8670us  728.07ms  cudaStreamCreateWithFlags
  0.93%  316.50ms   3114505     101ns      70ns  1.8341ms  cudaSetupArgument
  0.63%  214.91ms       472  455.33us  3.2420us  201.55ms  cudaMalloc
  0.60%  204.37ms       695  294.06us     730ns  193.87ms  cudaFree
  0.31%  106.35ms     35163  3.0240us  2.5810us  1.4839ms  cudaBindTexture
  0.20%  66.370ms    562608     117ns      73ns  1.4978ms  cudaConfigureCall
  0.19%  65.607ms    588442     111ns      66ns  1.5977ms  cudaGetLastError
  0.11%  37.160ms     35163  1.0560us     950ns  33.397us  cudaUnbindTexture
  0.07%  24.154ms     15628  1.5450us     757ns  275.20us  cudaEventRecord
  0.03%  11.726ms      7814  1.5000us     945ns  26.942us  cudaStreamSynchronize
  0.03%  9.0670ms       455  19.927us  3.6530us  1.3312ms  cudaMemcpy
  0.02%  8.0676ms      7814  1.0320us     718ns  32.317us  cudaStreamWaitEvent
  0.01%  2.5800ms       147  17.551us  8.2250us  153.49us  cudaStreamCreate
  0.01%  2.2455ms       210  10.692us  1.6500us  342.99us  cudaStreamDestroy
  0.00%  918.15us       277  3.3140us     256ns  145.05us  cuDeviceGetAttribute
  0.00%  665.98us         7  95.139us  16.184us  561.53us  cudaHostAlloc
  0.00%  448.76us         7  64.108us  7.1340us  388.77us  cudaFreeHost
  0.00%  382.35us       483     791ns     597ns  6.2300us  cudaEventDestroy
  0.00%  276.10us       287     962ns     781ns  8.1550us  cudaEventCreateWithFlags
  0.00%  224.82us         3  74.941us  72.614us  78.276us  cuDeviceTotalMem
  0.00%  211.42us       196  1.0780us     818ns  4.7000us  cudaEventCreate
  0.00%  156.28us       263     594ns     509ns  1.5440us  cudaDeviceGetAttribute
  0.00%  104.41us         3  34.802us  33.847us  35.309us  cuDeviceGetName
  0.00%  71.409us         7  10.201us  8.5030us  13.710us  cudaStreamCreateWithPriority
  0.00%  50.358us        28  1.7980us  1.6430us  2.3720us  cudaThreadSynchronize
  0.00%  50.043us         7  7.1490us  5.9380us  8.6380us  cudaMemsetAsync
  0.00%  27.197us        21  1.2950us     645ns  2.0280us  cudaGetDevice
  0.00%  19.687us         7  2.8120us  2.0730us  5.1200us  cudaDeviceSynchronize
  0.00%  10.260us         7  1.4650us  1.2320us  2.5320us  cudaHostGetDevicePointer
  0.00%  8.8170us         7  1.2590us  1.1510us  1.5340us  cudaDeviceGetStreamPriorityRange
  0.00%  2.3580us         5     471ns     257ns  1.2140us  cuDeviceGetCount
  0.00%  2.0630us         5     412ns     261ns     643ns  cuDeviceGet
  0.00%  1.5910us         2     795ns     790ns     801ns  cuInit
  0.00%     768ns         2     384ns     293ns     475ns  cuDriverGetVersion
  0.00%     135ns         1     135ns     135ns     135ns  cudaRuntimeGetVersion

kblomdahl commented 6 years ago

Adding a single dilated convolution had some good effects on the global perspective of Dream Go. After training for about 2 days on human games it only recognized two of our test cases as valid. Our previous, non-dilated, version recognized none of the test cases so this is still an improvement:

test ladder_1 ... ok
test ladder_3 ... FAILED
test dead_dragon_3 ... FAILED (-0.9999401)
test dead_dragon_1 ... ok
test dead_dragon_4 ... ok
test dead_dragon_2 ... FAILED (-0.3170574)
test ladder_2 ... ok
test end_1 ... ok

More dilation?

Since a single dilated convolution did not give a large enough effect I figured I could try adding two dilated convolutions (with dilation 2 and 3) to increase the peripheral vision of each residual block even further. With this enhancement each residual block effectively sees a 7x7 block, allowing information to travel from one side to another in only 3 residual block (in theory).

With this change each residual blocks gets this architecture:

x
├───┬───╮
D₁  D₂  D₃
├───┴───╯
C
│
y

x is the input to the residual block
D₁ is a convolutional layer with dilation 1 with 128 output channels.
D₂ is a convolutional layer with dilation 2 with 32 output channels.
D₃ is a convolutional layer with dilation 3 with 32 output channels.
C is a normal convolutional layer.
y is the output of the residual block

As you can observe I also increased the number of channels from 128 to 192 since we were afraid of the local shape information getting lost if we reduced the number of output channels to 64/32/32. This introduce additional variables to take into account when evaluating this change, but historically increasing the number of features has not helped much with the global perspective.

This architecture does very well on our test cases, the neural network only fails one of the dead dragon tests. The test that is fails is a game that white should win by 7.5 points, because a black dragon has one, and a false eye, if the neural network misjudged the group as alive then black would win by 72.5 points:

test ladder_1 ... ok
test dead_dragon_2 ... ok
test dead_dragon_3 ... FAILED (-0.056128964)
test dead_dragon_1 ... ok
test dead_dragon_4 ... ok
test ladder_2 ... ok
test ladder_3 ... ok
test end_1 ... ok

As you can see the neural network judge the game as being pretty close, which suggests that it does not consider the dragon to be fully alive. But considering there is nothing else on the board that is undecided it is still a clear failure.

At the time of writing this the neural network has ran for 148.5k out of 245.7k steps so it has not been fully trained and may therefore be subject to change.

The performance of the neural network is as one would expect from the posts above, not great. It is 66% slower than the original neural network, which again correspond closely to the expected slowdown of

running 9 tests
test batch_size_01  ... bench:   4,376,127 ns/iter (+/- 54,739)
test batch_size_02  ... bench:   3,998,227 ns/iter (+/- 93,668)
test batch_size_04  ... bench:   4,852,502 ns/iter (+/- 79,824)
test batch_size_08  ... bench:   7,454,700 ns/iter (+/- 1,525,497)
test batch_size_16  ... bench:  11,036,062 ns/iter (+/- 579,864)
test batch_size_32  ... bench:  20,837,262 ns/iter (+/- 1,498,197)
test batch_size_64  ... bench:  40,088,689 ns/iter (+/- 4,096,839)
test batch_size_128 ... bench:  78,774,026 ns/iter (+/- 1,662,254)
test batch_size_256 ... bench: 161,106,596 ns/iter (+/- 9,528,660)

However if this is the price we have to pay for good predictions then that is an acceptable trade-off. But I still need to check so that this is not an artificial increase in strength (and the loss of quantity vs quality of rollouts is not worth it).

kblomdahl commented 6 years ago

I also trained a 128 channel version of the architecture described above with 32 channels in total devoted to dilations, so according to the previous diagram:

x is the input to the residual block
D₁ is a convolutional layer with dilation 1 with 96 output channels.
D₂ is a convolutional layer with dilation 2 with 16 output channels.
D₃ is a convolutional layer with dilation 3 with 16 output channels.
C is a normal convolutional layer.
y is the output of the residual block

This is, as expected, in-between the 1-dilation network and the 2-3 dilation network in terms of performance and precision. Unfortunately it completely misjudge the two dead dragons that are marked as FAILED, but at least it succeeds on some of them:

test ladder_1 ... ok
test ladder_2 ... ok
test ladder_3 ... FAILED
test dead_dragon_1 ... FAILED (-0.99884653)
test dead_dragon_2 ... ok
test dead_dragon_3 ... FAILED (-1)
test dead_dragon_4 ... ok
test end_1 ... ok

test batch_size_01  ... bench:   3,115,843 ns/iter (+/- 42,605)
test batch_size_02  ... bench:   2,646,697 ns/iter (+/- 75,162)
test batch_size_04  ... bench:   3,049,992 ns/iter (+/- 78,922)
test batch_size_08  ... bench:   4,227,070 ns/iter (+/- 73,766)
test batch_size_16  ... bench:   6,733,515 ns/iter (+/- 161,320)
test batch_size_32  ... bench:  12,286,137 ns/iter (+/- 249,813)
test batch_size_64  ... bench:  23,085,660 ns/iter (+/- 381,639)
test batch_size_128 ... bench:  44,886,785 ns/iter (+/- 765,558)
test batch_size_256 ... bench:  89,094,959 ns/iter (+/- 862,408)

Currently running a tournament between four different programs to determine which version of the programs is the best one. The settings are fast (but not blitz) games, with chinese scoring:

Each program is given 5 minutes of main time, and 3 seconds byo yomi.
No pondering, because both programs will share the GPU.
Chinese rules, with a komi of 7.5

The following programs are part of the test, all of them were trained using the same hyper-parameters but a random seed:

leela - Leela 0.11.0
dg-d-128-1 - No dilation
dg-d-128-2 - 96 convolution, and 32 2-dilation
dg-d-128-2-3 - 96 convolution, 16 2-dilation, and 16 3-dilation.
dg-d-192-2-3 - 128 convolution, 32 2-dilation, and 32 3-dilation.

I will update the following section with the results, but the expected results would be the following ranking based on the assumption that the network sanity tests have some correlation to reality. leela has been omitted from the list because it is just there to get some anchor to reality. The associated number of the number of steps per second during training (so higher is better):

dg-d-192-2-3 (0.95)
dg-d-128-2 (1.80)
dg-d-128-2-3 (1.34)
dg-d-128-1 (2.29)

This trial was cancelled after 37 games (for every match-up, so a total of 367 games) had been played since some match-ups could be eliminated due to a winner having already been determined. The most notable of which is all matches against leela, which performing very badly for some reason (pretty sure it should be stronger than this). The other candidate that could be eliminated is dg-d-192-2-3 which performed the worst of all candidates.

The remaining three candidates were put into another match-up that we can use to determine which of them were worth continuing with:

leela v dg-d-128-1 (37/50 games)
board size: 19   komi: 7.5
             wins              black         white        avg cpu
leela           1  2.70%       0   0.00%     1    5.56%    146.53
dg-d-128-1     36 97.30%       17 94.44%     19 100.00%    340.32
                               17 45.95%     20  54.05%

leela v dg-d-128-2 (37/50 games)
board size: 19   komi: 7.5
             wins              black         white        avg cpu
leela           2  5.41%       0   0.00%     2   11.11%    163.02
dg-d-128-2     35 94.59%       16 88.89%     19 100.00%    404.39
                               16 43.24%     21  56.76%

leela v dg-d-128-2-3 (37/50 games)
board size: 19   komi: 7.5
               wins              black          white       avg cpu
leela             1  2.70%       1    5.26%     0   0.00%    158.89
dg-d-128-2-3     36 97.30%       18 100.00%     18 94.74%    413.07
                                 19  51.35%     18 48.65%

leela v dg-d-192-2-3 (37/50 games)
board size: 19   komi: 7.5
               wins              black          white       avg cpu
leela             1  2.70%       1    5.26%     0   0.00%    171.99
dg-d-192-2-3     36 97.30%       18 100.00%     18 94.74%    377.70
                                 19  51.35%     18 48.65%

dg-d-128-1 v dg-d-128-2 (37/50 games)
board size: 19   komi: 7.5
             wins              black         white       avg cpu
dg-d-128-1     24 64.86%       13 68.42%     11 61.11%    630.82
dg-d-128-2     13 35.14%       7  38.89%     6  31.58%    675.98
                               20 54.05%     17 45.95%

dg-d-128-1 v dg-d-128-2-3 (37/50 games)
board size: 19   komi: 7.5
               wins              black         white       avg cpu
dg-d-128-1       15 40.54%       7  36.84%     8  44.44%    659.20
dg-d-128-2-3     22 59.46%       10 55.56%     12 63.16%    700.49
                                 17 45.95%     20 54.05%

dg-d-128-1 v dg-d-192-2-3 (36/50 games)
board size: 19   komi: 7.5
               wins              black         white       avg cpu
dg-d-128-1       22 61.11%       13 72.22%     9  50.00%    752.21
dg-d-192-2-3     14 38.89%       9  50.00%     5  27.78%    608.40
                                 22 61.11%     14 38.89%

dg-d-128-2 v dg-d-128-2-3 (36/50 games)
board size: 19   komi: 7.5
               wins              black         white       avg cpu
dg-d-128-2       18 50.00%       8  44.44%     10 55.56%    632.58
dg-d-128-2-3     18 50.00%       8  44.44%     10 55.56%    653.96
                                 16 44.44%     20 55.56%

dg-d-128-2 v dg-d-192-2-3 (36/50 games)
board size: 19   komi: 7.5
               wins              black         white       avg cpu
dg-d-128-2       23 63.89%       13 72.22%     10 55.56%    858.17
dg-d-192-2-3     13 36.11%       8  44.44%     5  27.78%    651.65
                                 21 58.33%     15 41.67%

dg-d-128-2-3 v dg-d-192-2-3 (36/50 games)
board size: 19   komi: 7.5
               wins              black         white       avg cpu
dg-d-128-2-3     20 55.56%       8  44.44%     12 66.67%    710.56
dg-d-192-2-3     16 44.44%       6  33.33%     10 55.56%    585.97
                                 14 38.89%     22 61.11%

gomill-sp.ctl.games.zip

After these match-ups the following ELO could be estimated, there are not enough games to determine an accurate rank and I consider the top 3 to be essentially equal. It is unknown why dg-d-192-2-3 performed as bad as it did, but the main theory would be that the larger network is slower, and the increased accuracy does not compensate for the decreased number of rollouts the engine can perform:

leela:0.11.0                    0.00
dg-d-192-2-3:0.5.0            533.04
dg-d-128-2:0.5.0              579.99
dg-d-128-2-3:0.5.0            612.88
dg-d-128-1:0.5.0              616.77

The same argument can be used to explain why dg-d-128-1 is ranked as number one, despite performing the worst of all networks on the sanity tests. But looking at the actual games listed above one can come to a different conclusion.

The dg-d-128-1 network plays very good local shape, but usually fails to account for global properties. However, most of the time good local shape leads to good global shape so it doesn't have too worry that often. Adding any dilation will reduce the number of channels the neural network can use to generate good local shape, in two ways:

Fewer channels devoted to local shape means it can recognize fewer patterns.
The regularization tries to force the network to consider all features equally, and because of how we concatenate the dilated and non-dilated channels it always has to think globally.

These two reasons interact, since there are more large patterns than there are local pattern, so it has to look at the global scope but has fewer channels to do so. This will result in it having to generalize global shape into local shape, which may not always work out. A few other observations to keep in mind:

dg-d-128-2-3 has 25% of its channels devoted to global thinking.
dg-d-192-2-3 has 33% of its channels devoted to global thinking.

So the larger the fraction of channels that are devoted to global thinking, the harder it will be for the network to be able to recognize local shape (because of the regularization factor mentioned above).

kblomdahl commented 6 years ago

The problem presented above have two issues:

Global properties that has been blended into local properties in previous residual blocks.
The second convolutional layer in each residual block having to consider both dilated and non-dilated features as equals due to the regularization.

The second problem is easy to solve, we could just decrease the regularization coefficient or drop the second residual blocks from the regularization completely. We could also do some gated architecture as below, using, for example, the batch normalization scale parameter as G₂ and G₃:

x
├───┬───╮
D₁  D₂  D₃
│   │   │
│   G₂  G₃
├───┴───╯
C
│
y

It is not obvious if we have to solve the first problem, or if solving the second is enough for the optimizer to reserve some channels for local properties on its own. The only solution to the first problem we can think of would be to run separate towers for the different dilation levels and then combine then at the final layer but this has several issues on its own. Some hybrid approaches where only some residual blocks use dilation might also work.

lightvector commented 6 years ago

What do you mean by "local" properties versus "global" properties? If either way a property of the Go position is computed accurately ("this stone belongs to a group that has only one eye within radius 6 of this location") it does not matter if the computation of that property involved convolutions with different dilation levels or not. Some properties may be easier or harder to compute using different mixes of different dilations of course, but I think there's no reason to try to avoid blending them, because there's no such thing in the first place as an intrinsically "dilation 1" feature or a "dilation 2" feature that can only usefully be used by further convolutions with the exact same dilation factor.

I'm possibly misunderstanding something?

Also, I'm curious- what regularization are you referring to? Keep in mind that if you're using an L2 penalty on your weights but you're also using relus and batchnorm, then my understanding is that the L2 penalty does not have a significant regularization effect to begin with, so it has no relevance to whether any features are on equal footing with others or not. But if you're using a different regularization method things might be different.

It's cool to see these updates. I'd be interested to hear if you have results from your blitz games yet - it's possible the reduced performance is a bigger cost than the gain from better large-scale understanding, but if not, that would be really neat. :)

On Wed, Apr 11, 2018 at 1:32 PM, Karl Sundequist Blomdahl < notifications@github.com> wrote:

The problem presented above have two issues:

Global properties that has been blended into local properties in previous residual blocks.

The second convolutional layer in each residual block having to consider both dilated and non-dilated features as equals due to the regularization.

The second problem is easy to solve, we could just decrease the regularization coefficient or drop the second residual blocks from the regularization completely. We could also do some gated architecture as below, using, for example, the batch normalization scale parameter as G₂ and G₃:

x ├───┬───╮ D₁ D₂ D₃ │ │ │ │ G₂ G₃ ├───┴───╯ C │ y

It is not obvious if we have to solve the first problem, or if solving the second is enough for the optimizer to reserve some channels for local properties on its own. The only solution to the first problem we can think of would be to run separate towers for the different dilation levels and then combine then at the final layer but this has several issues on its own. Some hybrid approaches where only some residual blocks use dilation might also work.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Chicoryn/dream-go/issues/25#issuecomment-380533853, or mute the thread https://github.com/notifications/unsubscribe-auth/ALY5-6E2_jHqesTngD0ILHBuA9GLHl-Bks5tnj4xgaJpZM4R7nmu .

kblomdahl commented 6 years ago

To answer your question in a random order:

1. Blitz games

I can add some blitz games, they should be fast to play. In fact I would not be surprised if this post has some in it since I started some just as I was typing this sentence and I'm planning on writing a fair bit more.

These blitz games are what we internally refer to as policy play games, i.e. they play greedily according to what the neural network suggests with no search. So these results should indicate the quality of the neural network predictions:

dg-d-128-1 v dg-d-128-2 (100/100 games)
board size: 19   komi: 7.5
             wins              black         white       avg cpu
dg-d-128-1     45 45.00%       24 48.00%     21 42.00%      3.21
dg-d-128-2     55 55.00%       29 58.00%     26 52.00%      4.05
                               53 53.00%     47 47.00%

dg-d-128-1 v dg-d-128-2-3 (100/100 games)
board size: 19   komi: 7.5
               wins              black         white       avg cpu
dg-d-128-1       39 39.00%       18 36.00%     21 42.00%      3.23
dg-d-128-2-3     61 61.00%       29 58.00%     32 64.00%      4.18
                                 47 47.00%     53 53.00%

dg-d-128-1 v dg-d-192-2-3 (100/100 games)
board size: 19   komi: 7.5
               wins              black         white       avg cpu
dg-d-128-1       46 46.00%       23 46.00%     23 46.00%      3.31
dg-d-192-2-3     54 54.00%       27 54.00%     27 54.00%      4.97
                                 50 50.00%     50 50.00%

dg-d-128-2 v dg-d-128-2-3 (100/100 games)
board size: 19   komi: 7.5
               wins              black         white       avg cpu
dg-d-128-2       42 42.00%       21 42.00%     21 42.00%      4.06
dg-d-128-2-3     58 58.00%       29 58.00%     29 58.00%      4.16
                                 50 50.00%     50 50.00%

dg-d-128-2 v dg-d-192-2-3 (100/100 games)
board size: 19   komi: 7.5
               wins              black         white       avg cpu
dg-d-128-2       52 52.00%       25 50.00%     27 54.00%      4.21
dg-d-192-2-3     48 48.00%       23 46.00%     25 50.00%      5.03
                                 48 48.00%     52 52.00%

dg-d-128-2-3 v dg-d-192-2-3 (100/100 games)
board size: 19   komi: 7.5
               wins              black         white       avg cpu
dg-d-128-2-3     55 55.00%       28 56.00%     27 54.00%      4.33
dg-d-192-2-3     45 45.00%       23 46.00%     22 44.00%      5.03
                                 51 51.00%     49 49.00%

gomill-blitz.ctl.zip

Results mirror what my sanity tests suggests, that dilations results in higher quality predictions. The exception is dg-d-192-2-3 which performs below average (again). The estimated elo of the networks based on these games is the following:

dg-d-128-1:0.5.0                0.00
dg-d-192-2-3:0.5.0             29.83
dg-d-128-2:0.5.0               33.33
dg-d-128-2-3:0.5.0             77.23

I claim the reason dg-d-192-2-3 performs worse than dg-d-128-2-3 (even though the later could fit inside of the first), is because the ratio of dilated and non-dilated channels is larger in the first. There are other factors that could be responsible, such as random initialization and just bad luck (since we only played 100 games).

2. Global and local features / properties

I've been using global properties and local properties somewhat fuzzily intentionally since I cannot claim to understand exactly what the neural network computes in the first place, and I am not a professional baduk player. A better word for what I mean might be near periphical features and far periphical features. Where near periphical features contains information about stones close to the centre of each convolution, and far periphical features are about stones far from the centre of each convolution.

The networks with dilated convolutions seems to favour strategies that involves far periphical features, so things like large scale captures, and influence. The networks with dilation also perform worse during Life & Death problems, that would involve mostly near periphical features. This is all from me skimming some of the games in the archive I linked above and it is possible I am wrong.

The reason for this behaviour, I think, is that the way we have been adding dilation force the network to reserve some channels to far periphical features, whereas before it had some choice on this†. My current concerns about dilation is related to the forcing part, which is explored further in section 3.

Since different features are easier to compute with input from different dilation levels. If we force the network to always consider the far periphical features then it will have a harder time computing some features, and vice versa for near periphical features.

For example it may be hard to recognize an eye, if you have no choice but to look at 5x5 patterns since the stones marked with ? are irrelevant to whether it is an eye or not, and having to store all combinations of these 16 question marks takes up considerable space in the weights:

? ? ? ? ?
? X X X ?
? X   X ?
? X X X ?
? ? ? ? ?

This is of course a simplified example, the optimizer is not so stupid it would store all combinations of the 16 question marks. But something similar to this is going on since you can observe the style differences between the networks.

† Recent research calls the claim that it could choose to do some into question.

3. Regularization

I am using L2 regularization, batch normalization, and gradient clipping during the training. The last one is unnecessary at this point, but was useful while I still had a bug during initialization where it would sometimes try to factorize singular matrices (resulting in huge weights that would collapse the entire network to zero without clipping).

You are correct in that batch normalization and L2 regularization is somewhat redundant, but my understanding says that L2 regularization achieves two things:

~~It avoid the output of each layer exploding.~~ This is redundant with batch normalization.
Due to its quadratic nature, it encourage all weights to be roughly the same size. This is to avoid overfitting to only some features. Batch normalization does not do this.

It is the second effect that I am worried about since it says D₁, D₂, and D₃ must all be equally important to whatever C is computing (using notation from the figure below), and since some features are easier to compute without D₂ or D₃ this constraint might make it hard for the network to learn certain features.

x
├───┬───╮
D₁  D₂  D₃
├───┴───╯
C
│
y

4. Alternative Interpretation

My main reason for worrying about this is the fact that the network with dilation seems to play worse than than the network without dilation. The blitz game results suggests that this is mainly due to the lack of search due to time constraints, which would also give rise to the same problem where it would fail to notice vital moves during non-trivial situations as they require reading to spot.

5. Conclusion

I will train another network with the following configuration, without L2 regularization to see if it has any significant effect. I do not believe we will have any problems with overfitting, and if the L2 regularization had no impact then results should be the same as before:

D₁ - 96 channels
D₂ - 16 channels
D₃ - 16 channels

This mirrors the dg-d-128-2-3 architecture, and it seems to be the currently best network. I expect this to take about 2 days, 2 hours, and 30 minutes.

PS: My idea about separate towers for different dilations is probably stupid, and does not really make sense upon further inspection.

lightvector commented 6 years ago

1. (Blitz games) -

Sorry for the confusion: I misread your earlier post, where you said you were running some fast (non-blitz) games. Did those pan out, or were the new neural nets worse once taking into account the shallower search due to the worse performance?

For these games though - nice results!

2. (Near vs Far features)

I think I understand you. Yes, if the dilated channels are there, then the neural net will use them, and therefore devote some proportion of the non-dilated channels to computing features that are useful for the dilated channels, so long as that improves the overall loss function more than not doing so. That will obviously make it worse at doing whatever the excess non-dilated channels were doing before.

I don't think this is a problem though, it simply is a tradeoff of network capacity. To give a different example - imagine you originally did not provide any history planes as an input feature, and now you add some, but don't increase the number of channels in the rest of the neural net. Then obviously the neural net will get worse at some kinds of tactics, because it will be now devoting some of its channels to processing the new history information, instead of devoting them to whatever local shapes it was doing before. But doing so improves the overall prediction quality, because the new history information is strongly predictive of other things.

I don't think there is any special about dilated/nondilated, it is exactly like adding any other new information or representation capacity. It will cause a tradeoff to use the new capacity, but in a way that (unless you're experiencing major underfitting or overfitting) should be overall better in predictive ability. Of course, better predictive ability does not always mean more strength, because predictive ability and playing strength are two different things.

3 (Regularization)

I'm putting this in a separate reply because this gets pretty technical.

4. (Alternative Interpretation)

Yes, of course. As in #2 above, I think there is no significant pathology or problem with "mixing" this kind of information, if there is a loss in strength there is a good chance it is due to something like:

Performance cost of the new architecture.
Increased predictive ability of the net actually does not correspond improved strength because predictive accuracy is not the same thing as playing strength. (But your blitz games would seem to suggest this is not the case).

lightvector commented 6 years ago

3. (Regularization)

You mentioned this:

Due to its quadratic nature, it encourage all weights to be roughly the same size. This is to avoid overfitting to only some features. Batch normalization does not do this.

But actually, to first order L2 loss does NOT encourage all weights to be the same relative size as each other and does NOT have a significant effect on effect on overfitting, in the presence of batch normalization, and if you are using gradient descent. I could be mistaken, but I'm reasonably sure about this. This is a surprising fact if you have not thought about it before!

Why?

To first order, taking a gradient step due to (data loss + L2 loss) is approximately the same as taking a gradient step due to data loss and then taking a gradient step due to L2 loss.
A gradient step due to L2 loss moves each weight towards 0 directly proportional to the magnitude of the weight.
Therefore, a gradient step due to L2 loss is equivalent to multiplying all weights by a fixed scaling factor C for some C slightly less than 1.
Due to batch normalization, this has no effect on the prediction or on the data loss! (unless somehow you are frequently in the variance ~= 0 case, where the tiny constant term in the denominator that prevents a divide by zero comes into play).

This means there is no regularization effect or avoidance of overfitting. For example, imagine we are at a local minimum in data loss where weight A and B serve extremely similar purposes but weight A is twice as large as B. Then after scaling due to the L2 loss step, weight A will still be twice as large as B, and there will still be no data loss gradient to change them, so even as both shrink, A will remain twice as large as B forever.

This is different than if there was no batchnorm. In that case, there after scaling there would be a data loss gradient to re-increase both A and B since both are now too small. If A and B serve the same purpose, then A and B would experience the same gradient upward, but since B is only decreased by the L2 loss half as much but it is re-increased just as much, the net effect will over time to make A = B, as expected.

So with batch norm, the only thing that is affected by L2 loss is the global scale of the weights in each layer. This still does have an effect on the gradient, but only on the scale. Multiplying all the weights in a layer by a constant factor C followed by a batchnorm causes all gradients in that layer to be multiplied by 1/C during the backward pass, which is an effective factor of 1/C^2 in the relative gradient (relative to the magnitude of the weight).

If you are using momentum, then the picture is changed a bit, but broadly I think the same analysis holds. If you are using an entirely different kind of optimizer, such as ADAM, then the above does not hold, but I think L2 still have a very strange effect that is very different than the regularization it has without batchnorm.

Conclusion:

L2 loss causes no regularization effect or avoidance of overfitting when using batch normalization because it has no effect on the predictions or on the relative directions of gradients. Instead, to first order it only affects the scale of weights, so it is approximately the same as training without any L2 but performing a slight increase in the learning rate on every iteration.

kblomdahl commented 6 years ago

Blitz games

The result of the fast games are in this reply. I can understand the confusion since I tend to heavily edit my posts as most of the time they end up just being an experimental log that no one (?) except me reads.

In summary the results of the fast games were mixed, adding dilated convolutions produced some better and some worse (!) engines. But there was no significant jump in strength, I suspect mostly because of the performance issues, resulting in fewer rollouts for the networks using dilation.

Near and Far features

I think we agree here, which features were added / removed / changed does not really matter, and some change in behaviour is to be expected. If this L2 regularization thingy is something (I'll get to your second post later), then it would be a problem for features outside of dilation too.

My worry originally comes from the fact that dg-d-192-2-3 performs worse (even during blitz) than any other network using dilation despite being larger. But this could be because it is larger, but was trained for the same number of steps as the other networks so it received "less training per weight" (I am not sure if this is a thing).

Regularization

I am using SGD with momentum but I do not believe this is important for this discussion as the analysis should turn out the same.

I think your analysis is correct, but the assumption that the weight_decay (L2 regularization) and gradient descent will be done in sequence does not hold in practice where they are done independently.

If batch normalization and weight decay is performed independently then I believe the L2 regularization still encourage the weights to be roughly the same magnitude. This is because the SGD update formula with weight_decay and a constant gradient norm puts a hard limit on the size of the weights, and it will be hard for the optimizer to maintain the effort it needs to counter the weight_decay considering how noisy the gradients are in SGD (though less so when using momentum):

next_weights = weights - weight_decay · weights - gradients

If we want the next_weights to increase then gradients must be larger than the weight_decay · weights so weights has an upper bound of -gradients / weight_decay:

next_weights > weights
⇒ weight_decay · weights > -gradients
⇒ weights > -gradients / weight_decay

I might look further into this tomorrow, but it is getting a bit late at the moment.

Conclusion

Regardless of the reason, we both seem to agree that removing the L2 regularization I am using at the moment is a good idea, since:

You claim it does nothing except waste GPU cycles during training.
I claim it might hurt neural network performance.

lightvector commented 6 years ago

Sounds good.

Regarding the sequential application of L2 and data loss gradient, I agree that doing them not in sequence is a small difference, but it is much smaller than the first-order effect that batch norm takes away (consider that except at the very start of training, a single gradient update usually changes each weight by a miniscule fraction of a percent of the root mean square of the weights, so the second order effects from sequence vs simultaneous are very small).

Keep in mind that all of the analysis I wrote above is about the relative magnitudes of the weights. It doesn't matter if the optimizer can counter the weight decay or not, because decayed weights behave the same as undecayed weights if the next layer is a batch norm, what matters is the relative magnitude. The weight A and weight B example I gave is a good example to go back to. L2 penalty causes both A and B to shrink proportionally, with A continuing to be twice as large. The regularizing behavior would be to make A/B ~= 1, and if there is no batch norm, that is what you get, because the data provides pressure to increase them again, such that the local optimum has A/B ~= 1. But when there is batch norm, there is no such pressure, the decayed weights are just as good. If A and B both shrink enough and then random walk due to noise enough so that randomly B could become larger sometimes, then of course you are free to call that "regularization", but it is exactly the same kind of "regularization" as if you removed L2 loss entirely and just turned up the learning rate enough that B could randomly sometimes overtake A. Sometimes the opposite would happen and it would become even smaller or negative compared to A. This is a very different kind of "regularization" than one that actually encourages and converges to precisely A/B ~= 1 as the unique local optimum.

Edit: Fixed some spacing. Also, looking forward to any further updates in the future, this thread has been great and following it has been very interesting! :)

On Wed, Apr 11, 2018 at 10:08 PM, Karl Sundequist Blomdahl < notifications@github.com> wrote:

Blitz games

The result of the fast games are in this reply https://github.com/Chicoryn/dream-go/issues/25#issuecomment-379918474. I can understand the confusion since I tend to heavily edit my posts as most of the time they end up just being an experimental log that no one (?) except me reads.

In summary the results of the fast games were mixed, adding dilated convolutions produced some better and some worse (!) engines. But there was no significant jump in strength, I suspect mostly because of the performance issues, resulting in fewer rollouts for the networks using dilation. Near and Far features

I think we agree here, which features were added / removed / changed does not really matter, and some change in behaviour is to be expected. If this L2 regularization thingy is something (I'll get to your second post later), then it would be a problem for features outside of dilation too.

My worry originally comes from the fact that dg-d-192-2-3 performs worse (even during blitz) than any other network using dilation despite being larger. But this could be because it is larger, but was trained for the same number of steps as the other networks so it received "less training per weight" (I am not sure if this is a thing). Regularization

I am using SGD with momentum https://github.com/Chicoryn/dream-go/blob/master/contrib/trainer/dream_tf/__main__.py#L658 but I do not believe this is important for this discussion as the analysis should turn out the same.

I think your analysis is correct, but the assumption that the weight_decay (L2 regularization) and gradient descent will be done in sequence does not hold in practice http://pytorch.org/docs/master/_modules/torch/optim/sgd.html#SGD where they are done independently.

If batch normalization and weight decay is performed independently then I believe the L2 regularization still encourage the weights to be roughly the same magnitude. This is because the SGD update formula with weight_decay and a constant gradient norm puts a hard limit on the size of the weights, and it will be hard for the optimizer to maintain the effort it needs to counter the weight_decay considering how noisy the gradients are in SGD (though less so when using momentum):

next_weights = weights - weight_decay · weights - gradients

If we want the next_weights to increase then gradients must be larger than the weight_decay · weights so weights has an upper bound of -gradients / weight_decay:

next_weights > weights ⇒ weight_decay · weights > -gradients ⇒ weights > -gradients / weight_decay

I might look further into this tomorrow, but it is getting a bit late at the moment. Conclusion

Regardless of the reason, we both seem to agree that removing the L2 regularization I am using at the moment is a good idea, since:

You claim it does nothing except waste GPU cycles during training.

I claim it might hurt neural network performance.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Chicoryn/dream-go/issues/25#issuecomment-380651843, or mute the thread https://github.com/notifications/unsubscribe-auth/ALY5-0CmjFqeBbbxZpdDw5U009oICt1Zks5tnrcygaJpZM4R7nmu .

killerducky commented 6 years ago

@lightvector I think I followed what you wrote, my understanding of how BN works is very weak. Are there any take aways from this that could apply to some of the other projects like LZGo, minigo, LZChess? Minigo and LZChess are having trouble recently.

lightvector commented 6 years ago

In the context of LZ, I think generally having some amount of L2 as is the case right now, is probably good. Not because of regularization, but to maintain the learning rate. Because for LZ, you train the net essentially forever, even new nets you often use net2net to bootstrap. If you don't have any L2 loss (i.e. weight decay), then over time the norm of the weights will drift larger (e.g. a high-dimensional brownian motion will very reliably move away from from the origin proportional to sqrt(time)). This means your effective learning rate will drop over time**, which is bad because because as you are always receiving new and better data, you want to maintain a fixed and high learning rate. But with some amount of L2, you will reach an equilibrium where the outward drift is balanced by the inward decay, and therefore maintain your effective learning rate.

For fixed-data sets (e.g. one-shot training a policy net to convergence on a fixed set of pro games), I don't think there is any particular value in L2 with batchnorm, since you want to anneal your learning rate anyways. Except maybe it makes your learning rate easier to think about if you tune it so that the equilibrium is actually roughly equal to the scale of your weights that you initialize with, so that now the only factor that affects your effective learning rate is your literal learning rate, rather than also this subtle weight-growth phenomenon.

I'm not sure if this amounts to any particular takeaway for LZ other than what it's already doing. If you are curious, you can check if you reached equilibrium by simply printing out the norm of the weights every few million training steps and seeing if the weight norm is no longer changing much. LZ almost certainly has.

* If you're having trouble following, it's pretty simple. Consider a Z = batchnorm(Y) where Y = W X. Double the weights W. Then, Y is doubled. So batchnorm is now dividing by an extra factor of 2 to undo the doubling. So dZ/dY is cut in half. Therefore dZ/dW is cut in half. Also because of batchnorm, we know only relative changes to weights matter. If we perform one update W := W - learningratedZ/dW, then since W was doubled and dZ/dW is half the size, the relative* step is one quarter as large now. So effectively the learning rate has been divided by 4.

kblomdahl commented 6 years ago

I think you are correct, I mentioned that I started training a new network a few posts age and the results (so far) match your predictions. The loss was virtually the same in the beginning but it fails to keep up with the other networks towards the end, your explanation about the effect of L2 regularization on the learning rate would explain this behaviour.

A takeaway from this is also that I am not training long enough (sigh), since it still benefits from an increasing learning rate. I included a screenshot of the accuracy and loss of the different networks below, the network without L2 regularization is the pink line that is noticeably trailing behind towards the end:

screenshot from 2018-04-13 20-08-29

The pink line is hard to see in some of the charts because they all converge within a percentage or two of each other anyway, but the network without L2 regularization is performing worse in all of the metrics.

This raises some interesting questions for me, at the moment I do not use batch normalization nor L2 regularization on the weights in the value head and policy head (I forget if there is a good reason for this). This might be a mistake if we are saying that the combination of batch normalization + L2 regularization is effectively just a dynamic learning rate boost, since then the weights in those heads are missing out on some training towards the end.

PS: You idea about looking at the norm of the weights as an indication on whether the learning rate is too small or too large is an interesting one since my norms would suggests my learning rate schedule is not too great (the norms are monotonically decreasing) and it should be relatively trivial to implement a dynamic learning rate based on the previous (maybe a moving average) and the current norm of the weights.

A screenshot of the per-variable norms are included below, you can ignore the norm of the offsets (aka biases or β is the batch normalization formula), which do not have L2 regularization:

screenshot from 2018-04-13 20-28-47

kblomdahl commented 6 years ago

On an unrelated note, did you do any experiments with what percentage of games to "drop" the history inputs during? You mention 5-10% in your repository, but is there any experimental data on what percentage yielded the best results?

I am wondering because looking at the monte-carlo trees one can clearly observe that the neural network has a strong tendency to memorize sequences of play, which is not a desirable property. The search can still overrule the suggestions of the neural networks of course but that is a waste of GPU cycles if we can just fix the problem in the network instead. Example a self-play game where you can see this behaviour is attached (see the fake ko fight in the corner).

The game includes all variations considered by the search, so it is pretty big (even if I only did 1,600 rollouts):

full_playouts.sgf.zip

lightvector commented 6 years ago

Thanks for asking! No, I didn't test this. I don't have any value net or MCTS component yet, so it's hard for me to do experiments regarding what would improve search. My first project goal needs only policy (neural-net-aided explorations of human biases by players of different ranks), so that's what I've started with.

I chose around 10% because it was a simple conservative number large enough to get the behavior I wanted but small enough to be very unlikely to cause much worse prediction when history was still present. If I were to experiment in the future, it would probably be:

Try making it larger and seeing the effect on policy quality.
In a few test positions try providing history at only 0.5 weight (halfway between dropped and undropped), or other values. Because the neural net has data only at 0 and 1 but no data in between, there is a reasonable chance it will approximately linearly interpolate (note: linearly in logit space, rather than probability space), because I know from literature that relu-based nets often behave near-linearly outside of their domain of training. If so, this achieves the effect of interpolating between the history and no-history outputs for a given position, but only costing one evaluation of the net instead of two.
In search try various parameters of how strongly to interpolate from history to no-history. Also see whether it would be better to lean closer to no-history in handicap games against weaker players, to make the neural net more distrustful of the weaker player's moves, which are less likely to be threatening.

(Edit: One minor detail - with my input representation, the neural net can always determine ko legality regardless of history or no-history. I would make sure that this remains the case, because I don't want no-history to also blind the neural net as to what moves are legal in the first place in ko fights, the neural net should of course always be given enough info to determine legality of current move).

kblomdahl commented 6 years ago

Do you have a source on the linear interpolation behaviour of relu? I would be interested in reading more about that since I was skimming my list of features to make sure it could still determine whether a move / ko is legal or not without the history planes, when I noticed that according to previous experiments that I have done the "Am I black or white?" feature only affects the accuracy of the value head by about -2.4% (but it did not affect the policy head accuracy at all). These experiments are not very recent so a lot could have changed, but this suggests said feature is mainly used for komi.

The number of testing games* that that are won because of komi are up to 13.2% (depending on how you score draws), which suggests the feature doesn't do its job too well† but if it did then you could implement a dynamic komi between -7.5 and 7.5 by setting the "Am I black or white?" feature plane to 0.5 + komi / 7.5 assuming some sort of interpolation. Not sure if it would also extrapolate to a larger or smaller komi.

Professional games from some recent edition of GoGoD. † This might be because not all training games has the same komi, so the network gets confused.

We are still trying to fix the learning rate after dropping the L2 regularization, which is very time consuming. We figured we'll try one of the fancy automatic learning schedules just for the sake of it, but said learning schedule is proving unstable as my training input is a mixture between three different data-sets which can cause some bad mini-batches and as a consequence a very noisy loss:

Tygem 9D players
KGS 7d+
KGS 4d

The original reason for mixing different datasets was to expose the neural network to moves that may only appear during reading in higher ranked games, and especially with lower ranked players to get it to distrust the opponents moves. It is not obvious if it is necessary to do this anymore if we start messing around with the history features as it would achieve the same thing, and the first point doesn't actually make sense.

So we could probably swap to a more monotonic dataset that would result in a less noisy loss, and therefore train faster.

lightvector commented 6 years ago

Mostly my intuitions about linearity come from a few papers like https://arxiv.org/pdf/1412.6572.pdf showing behavior in extrapolation that is linear-like (although later papers point to linearity of behavior as very much not being the sole factor in the existence of adversarial examples). Also posts like http://mlg.eng.cam.ac.uk/yarin/blog_3d801aa532c1ce.html that show that there it is possible to give a Bayesian interpretation to some of what neural nets do, and the fact that if along a particular dimension you only have two data points (the value at history = 0 and the value at history = 1), under many sensible priors your posterior will look like a linear interpolation.

I have not tested it, so this could be wrong. Maybe in reality it varies in a wiggly or jumpy fashion as you go from 0 to 1, in a way that unpredictably varies between differently-initialized nets trained on the same data. I expect that even if it is approximately linear, it is not exactly so, and will randomly wiggle and curve depending on the position.

I would NOT rely such an interpolation to handle a variety of komi when training a value net under an AlphaZero-like process. Instead I would just have my self-play games generated with a large variety of different komi, telling it what the komi was in each case. This is much more likely to be reliable, since rather than praying that the neural net interpolates correctly to 0.5 komi when only given -7.5 komi and 7.5 komi examples, you simply train it directly on 0.5 komi games, or 3.5 komi games, etc. Of course, you are unlikely to find such games from human data sets that don't suffer from bias (e.g. 0.5 komi games mean that White was a higher-ranked player), so this does require that you generate your data set yourself.

The reason I had such a crazy hack in mind to try testing for the history feature that I would definitely not want to use for komi was because I don't know a reasonable way where you can train with only "halfway providing" the correct history. You can train with only providing half the komi, because that actually changes the nature of your training data (affecting whether some positions are winning and losing), but I don't know how you do that with history. Either you provide history, or you don't.

You could try providing it with noise, for example making it 50% likely to be an incorrect history versus a correct one, but that seems very hard to do well because unless your noise distribution is highly plausible among histories that could have led to this position, the neural net will probably just learn to distinguish incorrect histories from correct ones, and then it will know mostly whether to pay attention to it or not.

kblomdahl commented 6 years ago

Re-reading Explaning and harnessing adversarial examples suggest to me that the interpolation is only locally linear around each known value, hence the ε. So for real valued features the interpolation will probably be fairly smooth but it is not obvious how it would behave for binary input features [1]. So using it for komi would almost certainly not work unless we provide it with actual examples as you suggest.

Never the less, my survey of articles about adversarial examples has further convinced me that perturbing the history features is actually very important for real-life performance. I will train the following networks, when I've finished re-tuning the learning rate, using the d-128-2-3 network as a base:

Keep the history features as is (baseline)
Set the history features to the identity state 10% of the time.
Set the history features to the identity state 50% of the time.
Set the history features to the identity state 100% of the time.
Shuffling the order of the history features 10% of the time.
Shuffling the order of the history features 50% of the time.
Shuffling the order of the history features 100% of the time.

You have to be a little bit careful when shuffling to avoid feeding the neural network rubbish, which would probably cause it to just ignore those features. So I suggest the following limitations:

Only shuffle black moves with black moves, and vice versa, so the players still alternate.
Only shuffle between captures, this should preserve ko fight logic and avoid illegal moves.

You could go further and provide similar random positions from other games, but I am not sure there is any point since that will not occur in practice.

[1] http://colinraffel.com/publications/iclr2018thermometer.pdf

Alternative solution to the history problem could be done using adversarial networks as suggested by Stijn Tonk [2]. Where we would, as part of the training, train an additional network that given some policy (and some additional data like the current board position?) tries to predict the previous move, we then include the loss of this adversarial networks during training.

Unclear how well this would fit to Go, but just noting it here in case I want to look into it later.

[2] Stijn Tonk, https://blog.godatadriven.com/fairness-in-ml

These are the experimental results of setting the occupied vertices in the history planes to different constants instead of 1.0. Note that we do not use one-hot encoding of the history planes, we use the AlphaGo representation which will probably skew these results, and without training the network using the methods described above it may just give essentially random results:

dg-h-000 sets the occupied vertices to 0.0. Example Game
dg-h-010 sets the occupied vertices to 0.1. Example Game
dg-h-025 sets the occupied vertices to 0.25. Example Game
dg-h-050 sets the occupied vertices to 0.5. Example Game
dg-h-075 sets the occupied vertices to 0.75. Example Game
dg-h-100 sets the occupied vertices to 1.0. Example Game

You can make some interesting observations from the example games (played with the --self-play option so the the entire games, especially the openings, are pretty random):

The networks with weak history features play super passive.
The networks with weak history features plays the initial move of some josekis, but does not follow up correctly (or at all).

I also played a blitz tournament between the top 3 engines dg-h-050, dg-h-075, and dg-h-100. The results are very weird to me, since dg-h-100 ends up losing to all of the other engines by a pretty wide margin:

dg-h-050 v dg-h-075 (21/100 games)
unknown results: 3 14.29%
board size: 19   komi: 7.5
           wins              black        white       avg cpu
dg-h-050     11 52.38%       4 36.36%     7  70.00%     11.03
dg-h-075      7 33.33%       2 20.00%     5  45.45%     11.11
                             6 28.57%     12 57.14%

dg-h-050 v dg-h-100 (20/100 games)
board size: 19   komi: 7.5
           wins               black          white        avg cpu
dg-h-050     20 100.00%       10 100.00%     10 100.00%      9.33
dg-h-100      0   0.00%       0    0.00%     0    0.00%      9.13
                              10  50.00%     10  50.00%

dg-h-075 v dg-h-100 (20/100 games)
unknown results: 2 10.00%
board size: 19   komi: 7.5
           wins              black         white      avg cpu
dg-h-075     16 80.00%       9  90.00%     7 70.00%      9.92
dg-h-100      2 10.00%       2  20.00%     0  0.00%     10.24
                             11 55.00%     7 35.00%

gomill-h-blitz.ctl.zip

I cancelled this blitz tournament after 20 games in each match-up since the results were pretty clear (and very weird), and instead re-started a tournament with 1,600 rollouts to check if this also affects the value head. I will post the results when they are done, said tournament should take a day or so to complete.

lightvector commented 6 years ago

Out of curiosity, I tested this now. I checked what happens if I feed in the history planes at 0.5 weight in my neural net. It uses one-hot indicators of the location of the previous several moves (each move in its own channel) instead of the AlphaGo representation, and as I mentioned before, was specifically trained to behave well at both 0 (history absent) and 1 (history present). Informally, I looked by hand at the colored prediction heatmaps of several dozen positions over several games, comparing between history plane at 0, 0.5, and 1.0 weight.

It definitely does interpolate. In every case I looked at by hand, the resulting 0.5 heatmap looked reasonable and was roughly "in between" the 0 and 1 heatmaps. I did not find any example where it did anything crazy or significantly non-interpolation-like.

However, it definitely was NOT a linear or consistent interpolation. While sometimes the 0.5 heatmap was pretty close to an average of the 0 and 1 heatmaps, sometimes also it was much closer to either the 0 or the 1 heatmaps alone, e.g. more like 0.1 [0 heatmap] + 0.9 [1 heatmap]. Also, it was not always uniform between moves on the board - sometimes move as you went 0 -> 0.5 -> 1, move A would light strongly from 0->0.5, and B only a little, and then from 0.5 -> 1, move A would only light up a bit more, while B would light strongly. This seemed more common when A and B were far apart on the board and/or differed in whether they were next to the last move or not.

Still, often it did give something clearly in-between. So, pretty interesting! :)

History

Half-history

No-history

kblomdahl commented 6 years ago

I completely forgot about the interpolation behavior as I was so surprised at the tournament results. No history in this case is dg-h-000, so the history features are set to zero, which is very different from what it was trained on. The identity version is with all of the history features set to the current board position:

History: dg-h-100-intp

Half-history: dg-h-050-intp

No-history: dg-h-000-intp

No-history (identity): dg-h-id-intp

Results from the tournament with playouts (1,600) looks as I expected the blitz games too look. The history features seems to contribute to an increased network strength. Since this is the opposite of the blitz results the obvious conclusion is that the value head collapse without the history features but the policy head can keep going.

It is unclear if the value head should benefit from the history features, but since we only compute a single shared representation for both the policy and value head there is not much choice for the value head other then using the history features if they allow for a lower net loss in the policy head:

dg-h-050 v dg-h-075 (100/100 games)
unknown results: 16 16.00%
board size: 19   komi: 7.5
           wins              black         white       avg cpu
dg-h-050      3  3.00%       3   6.00%     0   0.00%    489.50
dg-h-075     81 81.00%       42 84.00%     39 78.00%    467.57
                             45 45.00%     39 39.00%

dg-h-050 v dg-h-100 (100/100 games)
unknown results: 14 14.00%
board size: 19   komi: 7.5
           wins              black         white       avg cpu
dg-h-050      9  9.00%       3   6.00%     6  12.00%    527.77
dg-h-100     77 77.00%       37 74.00%     40 80.00%    489.52
                             40 40.00%     46 46.00%

dg-h-075 v dg-h-100 (100/100 games)
unknown results: 14 14.00%
board size: 19   komi: 7.5
           wins              black         white       avg cpu
dg-h-075     27 27.00%       13 26.00%     14 28.00%    456.90
dg-h-100     59 59.00%       31 62.00%     28 56.00%    457.94
                             44 44.00%     42 42.00%

gomill-h-1600.ctl.zip

alreadydone commented 6 years ago

Just letting you know that I implemented your idea in https://github.com/Chicoryn/dream-go/issues/25#issuecomment-381417656 for Leela Zero at https://github.com/gcp/leela-zero/issues/1599 together with dynamic komi. It was pointed out by @TFiFiE that the formula 0.5 + komi / 7.5 should be 0.5 + komi / 15.0.

kblomdahl commented 6 years ago

@alreadydone That is very cool. I've not read your entire thread but your implementation seems to be working much better than I expected when I coined the original concept.

As for the monotonicity of different networks, my intuition says this has to do with overfitting of the network to minor correlations in the training data, which could be due to too little training data or too low of a learning rate (or other reasons, this is not a solved problem). If you have a lot of spare GPU cycles you might want to consider training a robust network [1] using something like PGD, which should avoid many of the local maximums that breaks the monotonicity as they are effectively adversarial examples to your network.

[1] Towards Deep Learning Models Resistant to Adversarial Attack