Closed kblomdahl closed 5 years ago
All these enhancements are for the policy network, which is generally doing pretty well even without these tweaks. I would be more interesting in seeing if this improves the value head, in which you can observe the following issue consistently:
Following the logic posted in lightvector's repository one could reasonably expect Parametric ReLUs and Global Pooling Properties to have little effect on this problem. However Chain Pooling might greatly help with this issue.
To help with the performance impact, one could limit the chain pooling to the final convolutional layer in the value head with input x
:
y₀ ← max chain pooling of x
y₁ ← relu(bn(C(W₁, [x y₀])) + b₁)
y₂ ← relu(W₂ y₁ + b₂)
y₃ ← tanh(W₃ y₂ + b₃)
This is only one layer of chain pooling so we expect it to have issues identifying problems with loosely connected groups or certain types of false eyes. But it would probably help in most situations.
To implement this we would have to do a custom CUDA kernel, which could probably be done relatively easily using some flood filling technique:
int global_index = threadIdx.x;
do {
if (global_index == 0) {
*is_dirty = 0;
}
__syncthreads(); // barrier
for (int i = 0; i < batch_size; ++i) {
for (int c = 0; c < num_channels; ++c) {
const float original = data[c, global_index];
if (chain[global_index] == chain[N[global_index]])
data[i, c, global_index] = max(data[i, c, global_index], data[i, c, N[global_index]]);
if (chain[global_index] == chain[E[global_index]])
data[i, c, global_index] = max(data[i, c, global_index], data[i, c, E[global_index]]);
if (chain[global_index] == chain[S[global_index]])
data[i, c, global_index] = max(data[i, c, global_index], data[i, c, S[global_index]]);
if (chain[global_index] == chain[W[global_index]])
data[i, c, global_index] = max(data[i, c, global_index], data[i, c, W[global_index]]);
if (original < data[c, global_index])
*is_dirty = 1;
}
}
__syncthreads(); // barrier
} while (*is_dirty > 0);
Lightvector updated his blog with some more result, of especial interest is the fact that adding dilation has a similar effect as adding chain pooling [1] but since you only apply dilation over some of the channel local shape information is not lost. This is very promising since dilation is built into the convolutional operator in most frameworks, including cuDNN.
The problem is that according to the cuDNN documentation, only the ImplicitPrecompGemm
algorithm supports dilation > 1
, and based on previous benchmarks Winograd provides a 4-5x performance improvement. If we want to run two parallel convolutions, one with a dilation of 1 (normal convolution), and one with a dilation of 2 (or larger) with no performance loss then the later filter must be 4-5 times smaller than the previous. So if we want a total of c
input and output channels, then we can only reserve c / 5.5
channels for dilation:
For some common filter sizes this gives (rounded down to the closest multiple of 8 for SIMD purposes). We want some balance between local and global shape anyway, so this might provide a good mixture between the two:
Channels | Normal | Dilation |
---|---|---|
128 | 112 | 16 |
192 | 160 | 32 |
256 | 216 | 40 |
Lightvectors observations about history [2] is also very interesting, not because of the reason he mentions. But because his suggestion to zero out the history channels randomly can act as a training data augmentation, which should help with potential overfitting.
I am particularly interested in this because I've observed the same behaviour he cites, where the neural network learns sequences of moves instead of judging each individual board position separately. It is understandable why it does so (humans does this too), but it is not a desirable property and I've been considering getting ride of the history features completely to avoid this. Unfortunately the history features are very important, so this might provide a reasonable in-between.
One could even take this a step further, where if one were to use one-hot encodings of the history planes, then one could shuffle the history planes (with care, to avoid illegal board positions) in order to provide a sort of tewari-like effect.
[1] https://github.com/lightvector/GoNN#dilated-convolutions-mar-2018 [2] https://github.com/lightvector/GoNN#some-thoughts-about-history-as-an-input-mar-2018
With a naive cuDNN implementation the performance hit is quite significant, even when computing the normal and dilated convolution in parallel the neural network with dilation is about 36% slower.
However even when using a network that was only trained for a few hours it succeeds one of the dead dragon tests (and in the other two the neural network is less certain about who is winning than it has historically been). No other networks has managed this to date, so even without any performance improvements I might be able to squeeze out dilation is probably worth it:
running 7 tests
test ladder_2 ... ok
test ladder_3 ... ok
test ladder_1 ... FAILED
test dead_dragon_1 ... ok
test dead_dragon_2 ... FAILED
test dead_dragon_3 ... FAILED
test end_1 ... ok
For reference, this network trained for 5,901 steps (using a batch size of 512) and achieved a 40.27% policy accuracy and a 59.96% value accuracy after about 1 hour and 17 minutes of training. Based on previous experience these numbers should improve significantly with more training.
Output from nvprof
of my implementation. Some observations when comparing to a profile from before we started using dilations:
In the API call section cudaLaunch
has increased from 6.94% to 12.01%, suggesting the overhead associated with launching kernels is a problem. Maybe it is time to write a fused kernel for each residual block.
We spent about the same amount of time computing the winograd calculations in the two profilings (22.9687s vs 22.1559s). This suggests that we did not properly adjust the number of channels to account for the tile size of winograd.
The concatenations adds about 2.78% overhead, which is acceptable even if we would prefer to avoid it if possible.
==3041== Profiling result:
Time(%) Time Calls Avg Min Max Name
45.02% 22.1559s 83793 264.41us 26.880us 8.1167ms maxwell_scudnn_winograd_128x128_ldg1_ldg4_relu_tile418n_nt
28.89% 14.2193s 48654 292.25us 137.64us 4.0314ms void cudnn::detail::implicit_convolve_sgemm<float, float, int=128, int=5, int=5, int=3, int=3, int=3, int=1, bool=1, bool=0, bool=0>(int, int, int, float const *, int, float*, cudnn::detail::implicit_convolve_sgemm<float, float, int=128, int=5, int=5, int=3, int=3, int=3, int=1, bool=1, bool=0, bool=0>*, kernel_conv_params, int, float, float, int, float, float, int, int)
6.23% 3.06499s 18909 162.09us 131.03us 1.8853ms void cudnn::detail::implicit_convolve_sgemm<float, float, int=1024, int=5, int=5, int=3, int=3, int=3, int=1, bool=1, bool=0, bool=0>(int, int, int, float const *, int, float*, cudnn::detail::implicit_convolve_sgemm<float, float, int=1024, int=5, int=5, int=3, int=3, int=3, int=1, bool=1, bool=0, bool=0>*, kernel_conv_params, int, float, float, int, float, float, int, int)
4.34% 2.13685s 57038 37.463us 13.143us 1.7758ms maxwell_scudnn_winograd_128x128_ldg1_ldg4_relu_tile148n_nt
3.90% 1.92040s 72104 26.633us 4.9370us 1.7498ms void op_tensor_kernel<int=2, float, float, float, int=32, int=1, int=4, int=2, cudnnOpTensorOp_t=2, cudnnNanPropagation_t=0>(cudnnTensorStruct, float*, cudnnTensorStruct, float const *, cudnnTensorStruct, float const *, float, float, float)
2.90% 1.42483s 140831 10.117us 5.0360us 1.4300ms void cudnn::winograd::generateWinogradTilesKernel<int=0, float, float>(cudnn::winograd::GenerateWinogradTilesParams<float, float>)
2.78% 1.36813s 59454 23.011us 6.2130us 1.8411ms void cudnn::detail::cubeTransposeDeviceGeneric_kernel<float, float, float, int=8, int=8, int=8, int=8, int=8, int=11, int=5>(int, int, int, int, int, int, int, int, int, float, float const *, float*)
1.12% 549.89ms 64317 8.5490us 621ns 144.96us [CUDA memcpy HtoD]
1.05% 514.99ms 105078 4.9010us 3.0140us 469.83us void op_tensor_kernel<int=2, float, float, float, int=128, int=1, int=1, int=4, cudnnOpTensorOp_t=2, cudnnNanPropagation_t=0>(cudnnTensorStruct, float*, cudnnTensorStruct, float const *, cudnnTensorStruct, float const *, float, float, float)
0.87% 427.14ms 48028 8.8930us 4.6970us 566.26us void op_tensor_kernel<int=2, float, float, float, int=64, int=1, int=2, int=4, cudnnOpTensorOp_t=2, cudnnNanPropagation_t=0>(cudnnTensorStruct, float*, cudnnTensorStruct, float const *, cudnnTensorStruct, float const *, float, float, float)
0.75% 368.82ms 15014 24.565us 15.042us 500.34us maxwell_scudnn_128x32_relu_interior_nn
0.69% 338.68ms 8109 41.766us 38.356us 526.71us void genericTranspose_kernel<float, float, float>(float, cudnnTensorStruct, float const *, float, cudnnTensorStruct, float*)
0.30% 147.44ms 1802 81.817us 49.605us 115.07us maxwell_scudnn_128x128_relu_small_nn
0.18% 90.139ms 7806 11.547us 6.8670us 140.41us void gemmSN_NN_kernel<float, float, float, int=128, int=2, int=4, int=8, int=4, int=4>(cublasGemmSmallNParams<float, float, float>, float const *, float const *, float, float, int)
0.18% 90.130ms 22521 4.0020us 2.2430us 9.8300us void add_tensor_kernel_v3<int=2, float, float, int=16, int=16, int=1, int=16, int=4>(cudnnTensorStruct, float*, cudnnTensorStruct, float const *, float, float)
0.16% 77.102ms 71600 1.0760us 436ns 129.61us [CUDA memcpy DtoH]
0.13% 64.344ms 3903 16.485us 13.473us 411.39us void gemmSN_NN_kernel<float, float, float, int=256, int=4, int=2, int=8, int=4, int=4>(cublasGemmSmallNParams<float, float, float>, float const *, float const *, float, float, int)
0.11% 52.684ms 7507 7.0170us 4.2180us 1.3240ms void cudnn::detail::softmax_fw_kernel_resident<int=2, float, float, int=256, int=1, int=0, int=0, int=32, int=0>(cudnnTensorStruct, float const *, cudnn::detail::softmax_fw_kernel_resident<int=2, float, float, int=256, int=1, int=0, int=0, int=32, int=0>, float*, int, float, float*, int, int)
0.08% 41.624ms 4202 9.9050us 6.6090us 13.739us void gemmSN_NN_kernel<float, float, float, int=128, int=2, int=4, int=8, int=2, int=4>(cublasGemmSmallNParams<float, float, float>, float const *, float const *, float, float, int)
0.07% 35.603ms 16816 2.1170us 1.4370us 4.9380us cudnn::maxwell::gemm::computeOffsetsKernel(cudnn::maxwell::gemm::ComputeOffsetsParams)
0.07% 32.421ms 1806 17.951us 7.1290us 26.356us sgemm_32x32x32_NN
0.06% 29.867ms 2101 14.215us 13.186us 15.316us void gemmSN_NN_kernel<float, float, float, int=256, int=4, int=2, int=8, int=2, int=4>(cublasGemmSmallNParams<float, float, float>, float const *, float const *, float, float, int)
0.05% 23.351ms 7507 3.1100us 2.5180us 158.30us void cudnn::detail::activation_fw_4d_kernel<float, float, int=128, int=1, int=4, cudnn::detail::tanh_func<float>>(cudnnTensorStruct, float const *, cudnn::detail::activation_fw_4d_kernel<float, float, int=128, int=1, int=4, cudnn::detail::tanh_func<float>>, cudnnTensorStruct*, float, cudnnTensorStruct*, int, cudnnTensorStruct*)
0.04% 21.188ms 7507 2.8220us 2.3150us 9.3850us void cudnn::detail::activation_fw_4d_kernel<float, float, int=128, int=1, int=4, cudnn::detail::relu_func<float, cudnnNanPropagation_t=0, bool=0>>(cudnnTensorStruct, float const *, cudnn::detail::activation_fw_4d_kernel<float, float, int=128, int=1, int=4, cudnn::detail::relu_func<float, cudnnNanPropagation_t=0, bool=0>>, cudnnTensorStruct*, float, cudnnTensorStruct*, int, cudnnTensorStruct*)
0.03% 16.783ms 2703 6.2080us 2.4880us 12.723us void gemv2N_kernel_val<float, float, float, int=128, int=32, int=4, int=4, int=1>(float, float, cublasGemv2Params_v2<float, float, float>)
0.00% 4.5730us 7 653ns 513ns 943ns [CUDA memset]
==3041== API calls:
Time(%) Time Calls Avg Min Max Name
81.47% 35.8463s 135561 264.43us 5.7860us 1.37762s cudaMemcpyAsync
12.01% 5.28529s 743193 7.1110us 3.6370us 1.34247s cudaLaunch
1.60% 701.94ms 56 12.535ms 8.0670us 701.29ms cudaStreamCreateWithFlags
1.38% 605.11ms 435406 1.3890us 701ns 1.4727ms cudaEventRecord
1.02% 450.81ms 4647754 96ns 69ns 323.22us cudaSetupArgument
0.59% 259.18ms 285266 908ns 739ns 316.25us cudaStreamWaitEvent
0.45% 197.32ms 605 326.14us 743ns 188.05ms cudaFree
0.43% 188.54ms 67563 2.7900us 2.5210us 291.60us cudaBindTexture
0.42% 183.41ms 391 469.09us 3.3660us 174.74ms cudaMalloc
0.21% 91.363ms 743193 122ns 74ns 303.63us cudaConfigureCall
0.20% 85.856ms 760009 112ns 67ns 1.4508ms cudaGetLastError
0.16% 69.804ms 67563 1.0330us 943ns 287.59us cudaUnbindTexture
0.05% 21.321ms 15014 1.4200us 951ns 306.74us cudaStreamSynchronize
0.02% 7.0849ms 356 19.901us 3.4990us 359.56us cudaMemcpy
0.01% 2.5367ms 147 17.256us 7.9430us 188.10us cudaStreamCreate
0.00% 2.0855ms 210 9.9300us 1.5840us 247.39us cudaStreamDestroy
0.00% 943.45us 277 3.4050us 254ns 138.42us cuDeviceGetAttribute
0.00% 709.33us 7 101.33us 15.754us 603.54us cudaHostAlloc
0.00% 467.68us 7 66.811us 7.8000us 405.69us cudaFreeHost
0.00% 363.21us 483 751ns 583ns 3.0420us cudaEventDestroy
0.00% 269.90us 287 940ns 769ns 5.0480us cudaEventCreateWithFlags
0.00% 228.96us 3 76.321us 73.634us 80.725us cuDeviceTotalMem
0.00% 202.74us 196 1.0340us 824ns 2.2250us cudaEventCreate
0.00% 153.88us 263 585ns 508ns 1.7050us cudaDeviceGetAttribute
0.00% 107.14us 3 35.713us 31.368us 40.518us cuDeviceGetName
0.00% 66.623us 7 9.5170us 7.7900us 12.701us cudaStreamCreateWithPriority
0.00% 48.792us 28 1.7420us 1.6170us 2.1620us cudaThreadSynchronize
0.00% 43.996us 7 6.2850us 5.4910us 7.5090us cudaMemsetAsync
0.00% 24.781us 21 1.1800us 601ns 1.8960us cudaGetDevice
0.00% 15.818us 7 2.2590us 1.8510us 3.6860us cudaDeviceSynchronize
0.00% 10.316us 7 1.4730us 1.1910us 2.5770us cudaHostGetDevicePointer
0.00% 9.1960us 7 1.3130us 1.0590us 1.5700us cudaDeviceGetStreamPriorityRange
0.00% 2.4040us 5 480ns 273ns 1.0940us cuDeviceGetCount
0.00% 2.0860us 5 417ns 292ns 506ns cuDeviceGet
0.00% 1.8050us 2 902ns 789ns 1.0160us cuInit
0.00% 899ns 2 449ns 445ns 454ns cuDriverGetVersion
0.00% 123ns 1 123ns 123ns 123ns cudaRuntimeGetVersion
Out of curiosity, is it faster if instead of concatenating you add after the next convolution? Using the following identity or similar: conv3x3(concat(x,y), [x_channels+y_channels, output_channels]) = conv3x3(x, [x_channels, output_channels]) + conv3x3(y, [y_channels, output_channels])
It's probably worse to do it this way because the next convolution ends up split up so you lose benefits of greater 'batching', but if concat is particularly expensive for some reason then there's an off-chance it's better.
I will try it, since my current implementation does not really have good memory access patterns due to the concatenation forcing me to temporarily re-write them as CNHW. Your re-formulation would allow us to keep NCHW the entire way, at the expense of more kernel launches. The sum of convolutions is also really fast to calculate since cuDNN fuses that into the convolution kernel (by allowing you to blend the input and output arrays).
I think I also screwed up my SIMD multiplier since if one looks at the runtime of a Winograd kernel over different number of output channels you can see some clear bumps on the graph where the number of outputs channels are a multiple of 32:
If you are curious my dilation implementation at the moment is pretty much the following, notice how both c
, d
, and y
needs to either read or write using a sub-optimal memory layout:
c <- transpose(conv3x3(x, [in_channels, c_channels]), [1, 0, 2, 3]) # as a fused op by specifying the strides of c to the transpose
d <- transpose(conv3x3(x, [in_channels, d_channels]), [1, 0, 2, 3]) # as a fused op by specifying the strides of d to the transpose
y <- transpose(c ++ d, [1, 0, 2, 3]) # list concatenation (fused since c and d is in continuous memory) followed by transpose
... # continue as normal with y
For the sake of transparency these are the benchmark number before I added dilation:
test batch_size_01 ... bench: 810,021 ns/iter (+/- 55,133)
test batch_size_02 ... bench: 1,159,539 ns/iter (+/- 49,392)
test batch_size_04 ... bench: 1,675,931 ns/iter (+/- 60,894)
test batch_size_08 ... bench: 3,327,576 ns/iter (+/- 90,876)
test batch_size_16 ... bench: 6,670,638 ns/iter (+/- 364,274)
test batch_size_32 ... bench: 13,517,297 ns/iter (+/- 509,426)
test batch_size_64 ... bench: 27,505,117 ns/iter (+/- 929,449)
These are the current benchmark numbers using the algorithm described in the previous section:
test batch_size_01 ... bench: 3,174,957 ns/iter (+/- 281,523)
test batch_size_02 ... bench: 2,547,549 ns/iter (+/- 185,185)
test batch_size_04 ... bench: 2,984,062 ns/iter (+/- 280,443)
test batch_size_08 ... bench: 4,251,221 ns/iter (+/- 349,213)
test batch_size_16 ... bench: 8,162,312 ns/iter (+/- 750,470)
test batch_size_32 ... bench: 16,532,217 ns/iter (+/- 3,610,558)
test batch_size_64 ... bench: 33,221,823 ns/iter (+/- 3,071,336)
I finished my mock-up implementation of the two ideas mentioned above and they are of mixed success. Changing the channel count was slightly better, while avoiding the concatenation does not seem to be worth it (probably due to the lack of a good fused kernel) and batching.
This gave a performance improvement of 6%, so nothing groundbreaking but a solid improvement:
test batch_size_01 ... bench: 2,582,608 ns/iter (+/- 85,706)
test batch_size_02 ... bench: 2,466,245 ns/iter (+/- 91,409)
test batch_size_04 ... bench: 2,885,751 ns/iter (+/- 156,826)
test batch_size_08 ... bench: 4,103,070 ns/iter (+/- 199,810)
test batch_size_16 ... bench: 7,735,140 ns/iter (+/- 267,049)
test batch_size_32 ... bench: 15,522,847 ns/iter (+/- 711,294)
test batch_size_64 ... bench: 31,670,363 ns/iter (+/- 571,301)
For the sake of completion this is a trace of the CUDA calls performed during a single residual block (for a batch size of 256). You can clearly see the issue being that the convolution and dilation both takes the same amount of time, so adding dilation effectively increased the amount of work during each residual block from 2 to 3. This match up with the observations elsewhere as it would predict a 33% performance loss:
Start Duration Grid Size Block Size Regs* SSMem* DSMem* Size Throughput Device Context Stream Name
...
100.850s 1.1007ms (722 1 1) (8 8 1) 55 2.2500KB 0B - - GeForce GTX 108 1 275 void cudnn::detail::implicit_convolve_sgemm<float, float, int=128, int=5, int=5, int=3, int=3, int=3, int=1, bool=1, bool=0, bool=0>(int, int, int, float const *, int, float*, cudnn::detail::implicit_convolve_sgemm<float, float, int=128, int=5, int=5, int=3, int=3, int=3, int=1, bool=1, bool=0, bool=0>*, kernel_conv_params, int, float, float, int, float, float, int, int) [7493139]
100.850s 44.693us (3 32 1) (32 4 1) 40 8.5000KB 0B - - GeForce GTX 108 1 274 void cudnn::winograd::generateWinogradTilesKernel<int=0, float, float>(cudnn::winograd::GenerateWinogradTilesParams<float, float>) [7493102]
100.850s 1.2420ms (48 10 2) (256 1 1) 128 48.000KB 0B - - GeForce GTX 108 1 274 maxwell_scudnn_winograd_128x128_ldg1_ldg4_relu_tile418n_nt [7493106]
100.852s 51.348us (16 32 1) (32 1 4) 32 0B 0B - - GeForce GTX 108 1 275 void op_tensor_kernel<int=2, float, float, float, int=32, int=1, int=4, int=2, cudnnOpTensorOp_t=2, cudnnNanPropagation_t=0>(cudnnTensorStruct, float*, cudnnTensorStruct, float const *, cudnnTensorStruct, float const *, float, float, float) [7493152]
100.852s 97.768us (16 96 1) (32 1 4) 32 0B 0B - - GeForce GTX 108 1 274 void op_tensor_kernel<int=2, float, float, float, int=32, int=1, int=4, int=2, cudnnOpTensorOp_t=2, cudnnNanPropagation_t=0>(cudnnTensorStruct, float*, cudnnTensorStruct, float const *, cudnnTensorStruct, float const *, float, float, float) [7493118]
100.852s 172.69us (46 9 17) (8 8 8) 8 2.7500KB 0B - - GeForce GTX 108 1 263 void cudnn::detail::cubeTransposeDeviceGeneric_kernel<float, float, float, int=8, int=8, int=8, int=8, int=8, int=11, int=5>(int, int, int, int, int, int, int, int, int, float, float const *, float*) [7493172]
100.852s 15.996us (4 32 1) (32 4 1) 40 8.5000KB 0B - - GeForce GTX 108 1 263 void cudnn::winograd::generateWinogradTilesKernel<int=0, float, float>(cudnn::winograd::GenerateWinogradTilesParams<float, float>) [7493176]
100.852s 1.3144ms (64 10 2) (256 1 1) 128 48.000KB 0B - - GeForce GTX 108 1 263 maxwell_scudnn_winograd_128x128_ldg1_ldg4_relu_tile418n_nt [7493180]
100.853s 123.39us (16 128 1) (32 1 4) 32 0B 0B - - GeForce GTX 108 1 263 void op_tensor_kernel<int=2, float, float, float, int=32, int=1, int=4, int=2, cudnnOpTensorOp_t=2, cudnnNanPropagation_t=0>(cudnnTensorStruct, float*, cudnnTensorStruct, float const *, cudnnTensorStruct, float const *, float, float, float) [7493192]
This turned out to be a bit problematic to implement this as we needed to not do a rectified linear unit on the final result of the addition of two convolutions. This probably does not sound too hard but we are using the fused operator cudnnConvolutionBiasActivationForward
which does the work of three kernels in one. In order to implement this we provided an alternative path which splits this into two separate calls to cudnnConvolutionForward
, and cudnnAddTensor
(followed by one cudnnActivationForward
on the final sum), so we turned one kernel into five (could be made into four as the two bias weights can be merged).
There is also, as observed by lightvector, less batching with this approach which is typically bad for performance. Interestingly enough this approach has a systematic advantage for a batch size of one:
test batch_size_01 ... bench: 2,205,420 ns/iter (+/- 46,547)
test batch_size_02 ... bench: 2,548,588 ns/iter (+/- 83,218)
test batch_size_04 ... bench: 3,160,420 ns/iter (+/- 335,384)
test batch_size_08 ... bench: 4,500,350 ns/iter (+/- 78,394)
test batch_size_16 ... bench: 7,852,688 ns/iter (+/- 359,734)
test batch_size_32 ... bench: 16,937,519 ns/iter (+/- 600,127)
test batch_size_64 ... bench: 34,805,903 ns/iter (+/- 738,911)
The profiling output for this approach suggest the bottleneck are the two non-fused convolutional kernels (note the lack of a relu
suffix in the first kernel):
Time(%) Time Calls Avg Min Max Name
27.19% 8.35323s 32472 257.24us 22.703us 5.5128ms maxwell_scudnn_winograd_128x128_ldg1_ldg4_tile418n_nt
24.92% 7.65386s 29745 257.32us 114.96us 4.4123ms void cudnn::detail::implicit_convolve_sgemm<float, float, int=128, int=5, int=5, int=3, int=3, int=3, int=1, bool=1, bool=0, bool=0>(int, int, int, float const *, int, float*, cudnn::detail::implicit_convolve_sgemm<float, float, int=128, int=5, int=5, int=3, int=3, int=3, int=1, bool=1, bool=0, bool=0>*, kernel_conv_params, int, float, float, int, float, float, int, int)
19.35% 5.94467s 25248 235.45us 24.578us 3.6630ms maxwell_scudnn_winograd_128x128_ldg1_ldg4_relu_tile418n_nt
7.23% 2.22092s 86558 25.658us 4.2700us 2.2978ms void op_tensor_kernel<int=2, float, float, float, int=32, int=1, int=4, int=2, cudnnOpTensorOp_t=2, cudnnNanPropagation_t=0>(cudnnTensorStruct, float*, cudnnTensorStruct, float const *, cudnnTensorStruct, float const *, float, float, float)
4.54% 1.39320s 48690 28.613us 3.6250us 3.0429ms void add_tensor_kernel_v3<int=2, float, float, int=32, int=1, int=4, int=2, int=2>(cudnnTensorStruct, float*, cudnnTensorStruct, float const *, float, float)
3.94% 1.21004s 18020 67.149us 43.379us 2.0338ms maxwell_scudnn_128x128_relu_small_nn
2.37% 729.19ms 5418 134.59us 124.24us 488.78us void cudnn::detail::implicit_convolve_sgemm<float, float, int=1024, int=5, int=5, int=3, int=3, int=3, int=1, bool=1, bool=0, bool=0>(int, int, int, float const *, int, float*, cudnn::detail::implicit_convolve_sgemm<float, float, int=1024, int=5, int=5, int=3, int=3, int=3, int=1, bool=1, bool=0, bool=0>*, kernel_conv_params, int, float, float, int, float, float, int, int)
2.28% 701.67ms 91376 7.6780us 3.6250us 1.7249ms void cudnn::winograd::generateWinogradTilesKernel<int=0, float, float>(cudnn::winograd::GenerateWinogradTilesParams<float, float>)
2.06% 631.92ms 39070 16.173us 1.6280us 1.4267ms void cudnn::detail::activation_fw_4d_kernel<float, float, int=128, int=1, int=4, cudnn::detail::relu_func<float, cudnnNanPropagation_t=0, bool=0>>(cudnnTensorStruct, float const *, cudnn::detail::activation_fw_4d_kernel<float, float, int=128, int=1, int=4, cudnn::detail::relu_func<float, cudnnNanPropagation_t=0, bool=0>>, cudnnTensorStruct*, float, cudnnTensorStruct*, int, cudnnTensorStruct*)
1.37% 421.88ms 53598 7.8710us 609ns 125.44us [CUDA memcpy HtoD]
1.00% 308.42ms 21636 14.255us 9.2160us 332.77us maxwell_scudnn_winograd_128x128_ldg1_ldg4_tile148n_nt
0.87% 268.24ms 12020 22.315us 12.811us 756.82us maxwell_scudnn_winograd_128x128_ldg1_ldg4_relu_tile148n_nt
0.68% 208.19ms 7814 26.643us 13.517us 427.71us maxwell_scudnn_128x32_relu_interior_nn
0.60% 184.39ms 49278 3.7410us 2.6420us 199.54us void op_tensor_kernel<int=2, float, float, float, int=128, int=1, int=1, int=4, cudnnOpTensorOp_t=2, cudnnNanPropagation_t=0>(cudnnTensorStruct, float*, cudnnTensorStruct, float const *, cudnnTensorStruct, float const *, float, float, float)
0.29% 87.697ms 16537 5.3030us 3.8090us 11.305us void op_tensor_kernel<int=2, float, float, float, int=64, int=1, int=2, int=4, cudnnOpTensorOp_t=2, cudnnNanPropagation_t=0>(cudnnTensorStruct, float*, cudnnTensorStruct, float const *, cudnnTensorStruct, float const *, float, float, float)
0.19% 59.357ms 57218 1.0370us 430ns 113.24us [CUDA memcpy DtoH]
0.18% 55.777ms 25834 2.1590us 1.3210us 13.579us cudnn::maxwell::gemm::computeOffsetsKernel(cudnn::maxwell::gemm::ComputeOffsetsParams)
0.16% 49.578ms 11721 4.2290us 1.8130us 140.49us void add_tensor_kernel_v3<int=2, float, float, int=16, int=16, int=1, int=16, int=4>(cudnnTensorStruct, float*, cudnnTensorStruct, float const *, float, float)
0.15% 45.971ms 4206 10.929us 6.7900us 16.528us void gemmSN_NN_kernel<float, float, float, int=128, int=2, int=4, int=8, int=4, int=4>(cublasGemmSmallNParams<float, float, float>, float const *, float const *, float, float, int)
0.13% 41.387ms 16218 2.5510us 2.3650us 11.090us void add_tensor_kernel_v3<int=2, float, float, int=128, int=1, int=1, int=4, int=2>(cudnnTensorStruct, float*, cudnnTensorStruct, float const *, float, float)
0.11% 34.524ms 2103 16.416us 13.456us 20.675us void gemmSN_NN_kernel<float, float, float, int=256, int=4, int=2, int=8, int=4, int=4>(cublasGemmSmallNParams<float, float, float>, float const *, float const *, float, float, int)
0.11% 33.634ms 3907 8.6080us 4.0550us 1.5123ms void cudnn::detail::softmax_fw_kernel_resident<int=2, float, float, int=256, int=1, int=0, int=0, int=32, int=0>(cudnnTensorStruct, float const *, cudnn::detail::softmax_fw_kernel_resident<int=2, float, float, int=256, int=1, int=0, int=0, int=32, int=0>, float*, int, float, float*, int, int)
0.09% 28.488ms 1806 15.773us 6.4810us 24.448us sgemm_32x32x32_NN
0.05% 15.373ms 5418 2.8370us 2.5020us 10.538us void add_tensor_kernel_v3<int=2, float, float, int=64, int=1, int=2, int=4, int=2>(cudnnTensorStruct, float*, cudnnTensorStruct, float const *, float, float)
0.05% 15.072ms 2703 5.5750us 2.1200us 11.871us void gemv2N_kernel_val<float, float, float, int=128, int=32, int=4, int=4, int=1>(float, float, cublasGemv2Params_v2<float, float, float>)
0.04% 12.135ms 3907 3.1060us 2.3350us 4.8130us void cudnn::detail::activation_fw_4d_kernel<float, float, int=128, int=1, int=4, cudnn::detail::tanh_func<float>>(cudnnTensorStruct, float const *, cudnn::detail::activation_fw_4d_kernel<float, float, int=128, int=1, int=4, cudnn::detail::tanh_func<float>>, cudnnTensorStruct*, float, cudnnTensorStruct*, int, cudnnTensorStruct*)
0.02% 5.4156ms 602 8.9960us 6.2360us 12.256us void gemmSN_NN_kernel<float, float, float, int=128, int=2, int=4, int=8, int=2, int=4>(cublasGemmSmallNParams<float, float, float>, float const *, float const *, float, float, int)
0.01% 3.9482ms 301 13.117us 12.596us 13.700us void gemmSN_NN_kernel<float, float, float, int=256, int=4, int=2, int=8, int=2, int=4>(cublasGemmSmallNParams<float, float, float>, float const *, float const *, float, float, int)
0.00% 4.5140us 7 644ns 527ns 817ns [CUDA memset]
==43157== API calls:
Time(%) Time Calls Avg Min Max Name
86.08% 29.2245s 110361 264.81us 5.7750us 1.37722s cudaMemcpyAsync
8.62% 2.92516s 562608 5.1990us 3.7800us 5.4377ms cudaLaunch
2.15% 728.72ms 56 13.013ms 7.8670us 728.07ms cudaStreamCreateWithFlags
0.93% 316.50ms 3114505 101ns 70ns 1.8341ms cudaSetupArgument
0.63% 214.91ms 472 455.33us 3.2420us 201.55ms cudaMalloc
0.60% 204.37ms 695 294.06us 730ns 193.87ms cudaFree
0.31% 106.35ms 35163 3.0240us 2.5810us 1.4839ms cudaBindTexture
0.20% 66.370ms 562608 117ns 73ns 1.4978ms cudaConfigureCall
0.19% 65.607ms 588442 111ns 66ns 1.5977ms cudaGetLastError
0.11% 37.160ms 35163 1.0560us 950ns 33.397us cudaUnbindTexture
0.07% 24.154ms 15628 1.5450us 757ns 275.20us cudaEventRecord
0.03% 11.726ms 7814 1.5000us 945ns 26.942us cudaStreamSynchronize
0.03% 9.0670ms 455 19.927us 3.6530us 1.3312ms cudaMemcpy
0.02% 8.0676ms 7814 1.0320us 718ns 32.317us cudaStreamWaitEvent
0.01% 2.5800ms 147 17.551us 8.2250us 153.49us cudaStreamCreate
0.01% 2.2455ms 210 10.692us 1.6500us 342.99us cudaStreamDestroy
0.00% 918.15us 277 3.3140us 256ns 145.05us cuDeviceGetAttribute
0.00% 665.98us 7 95.139us 16.184us 561.53us cudaHostAlloc
0.00% 448.76us 7 64.108us 7.1340us 388.77us cudaFreeHost
0.00% 382.35us 483 791ns 597ns 6.2300us cudaEventDestroy
0.00% 276.10us 287 962ns 781ns 8.1550us cudaEventCreateWithFlags
0.00% 224.82us 3 74.941us 72.614us 78.276us cuDeviceTotalMem
0.00% 211.42us 196 1.0780us 818ns 4.7000us cudaEventCreate
0.00% 156.28us 263 594ns 509ns 1.5440us cudaDeviceGetAttribute
0.00% 104.41us 3 34.802us 33.847us 35.309us cuDeviceGetName
0.00% 71.409us 7 10.201us 8.5030us 13.710us cudaStreamCreateWithPriority
0.00% 50.358us 28 1.7980us 1.6430us 2.3720us cudaThreadSynchronize
0.00% 50.043us 7 7.1490us 5.9380us 8.6380us cudaMemsetAsync
0.00% 27.197us 21 1.2950us 645ns 2.0280us cudaGetDevice
0.00% 19.687us 7 2.8120us 2.0730us 5.1200us cudaDeviceSynchronize
0.00% 10.260us 7 1.4650us 1.2320us 2.5320us cudaHostGetDevicePointer
0.00% 8.8170us 7 1.2590us 1.1510us 1.5340us cudaDeviceGetStreamPriorityRange
0.00% 2.3580us 5 471ns 257ns 1.2140us cuDeviceGetCount
0.00% 2.0630us 5 412ns 261ns 643ns cuDeviceGet
0.00% 1.5910us 2 795ns 790ns 801ns cuInit
0.00% 768ns 2 384ns 293ns 475ns cuDriverGetVersion
0.00% 135ns 1 135ns 135ns 135ns cudaRuntimeGetVersion
Adding a single dilated convolution had some good effects on the global perspective of Dream Go. After training for about 2 days on human games it only recognized two of our test cases as valid. Our previous, non-dilated, version recognized none of the test cases so this is still an improvement:
test ladder_1 ... ok
test ladder_3 ... FAILED
test dead_dragon_3 ... FAILED (-0.9999401)
test dead_dragon_1 ... ok
test dead_dragon_4 ... ok
test dead_dragon_2 ... FAILED (-0.3170574)
test ladder_2 ... ok
test end_1 ... ok
Since a single dilated convolution did not give a large enough effect I figured I could try adding two dilated convolutions (with dilation 2 and 3) to increase the peripheral vision of each residual block even further. With this enhancement each residual block effectively sees a 7x7 block, allowing information to travel from one side to another in only 3 residual block (in theory).
With this change each residual blocks gets this architecture:
x
├───┬───╮
D₁ D₂ D₃
├───┴───╯
C
│
y
As you can observe I also increased the number of channels from 128 to 192 since we were afraid of the local shape information getting lost if we reduced the number of output channels to 64/32/32. This introduce additional variables to take into account when evaluating this change, but historically increasing the number of features has not helped much with the global perspective.
This architecture does very well on our test cases, the neural network only fails one of the dead dragon tests. The test that is fails is a game that white should win by 7.5 points, because a black dragon has one, and a false eye, if the neural network misjudged the group as alive then black would win by 72.5 points:
test ladder_1 ... ok
test dead_dragon_2 ... ok
test dead_dragon_3 ... FAILED (-0.056128964)
test dead_dragon_1 ... ok
test dead_dragon_4 ... ok
test ladder_2 ... ok
test ladder_3 ... ok
test end_1 ... ok
As you can see the neural network judge the game as being pretty close, which suggests that it does not consider the dragon to be fully alive. But considering there is nothing else on the board that is undecided it is still a clear failure.
At the time of writing this the neural network has ran for 148.5k out of 245.7k steps so it has not been fully trained and may therefore be subject to change.
The performance of the neural network is as one would expect from the posts above, not great. It is 66% slower than the original neural network, which again correspond closely to the expected slowdown of
running 9 tests
test batch_size_01 ... bench: 4,376,127 ns/iter (+/- 54,739)
test batch_size_02 ... bench: 3,998,227 ns/iter (+/- 93,668)
test batch_size_04 ... bench: 4,852,502 ns/iter (+/- 79,824)
test batch_size_08 ... bench: 7,454,700 ns/iter (+/- 1,525,497)
test batch_size_16 ... bench: 11,036,062 ns/iter (+/- 579,864)
test batch_size_32 ... bench: 20,837,262 ns/iter (+/- 1,498,197)
test batch_size_64 ... bench: 40,088,689 ns/iter (+/- 4,096,839)
test batch_size_128 ... bench: 78,774,026 ns/iter (+/- 1,662,254)
test batch_size_256 ... bench: 161,106,596 ns/iter (+/- 9,528,660)
However if this is the price we have to pay for good predictions then that is an acceptable trade-off. But I still need to check so that this is not an artificial increase in strength (and the loss of quantity vs quality of rollouts is not worth it).
I also trained a 128 channel version of the architecture described above with 32 channels in total devoted to dilations, so according to the previous diagram:
This is, as expected, in-between the 1-dilation network and the 2-3 dilation network in terms of performance and precision. Unfortunately it completely misjudge the two dead dragons that are marked as FAILED
, but at least it succeeds on some of them:
test ladder_1 ... ok
test ladder_2 ... ok
test ladder_3 ... FAILED
test dead_dragon_1 ... FAILED (-0.99884653)
test dead_dragon_2 ... ok
test dead_dragon_3 ... FAILED (-1)
test dead_dragon_4 ... ok
test end_1 ... ok
test batch_size_01 ... bench: 3,115,843 ns/iter (+/- 42,605)
test batch_size_02 ... bench: 2,646,697 ns/iter (+/- 75,162)
test batch_size_04 ... bench: 3,049,992 ns/iter (+/- 78,922)
test batch_size_08 ... bench: 4,227,070 ns/iter (+/- 73,766)
test batch_size_16 ... bench: 6,733,515 ns/iter (+/- 161,320)
test batch_size_32 ... bench: 12,286,137 ns/iter (+/- 249,813)
test batch_size_64 ... bench: 23,085,660 ns/iter (+/- 381,639)
test batch_size_128 ... bench: 44,886,785 ns/iter (+/- 765,558)
test batch_size_256 ... bench: 89,094,959 ns/iter (+/- 862,408)
Currently running a tournament between four different programs to determine which version of the programs is the best one. The settings are fast (but not blitz) games, with chinese scoring:
7.5
The following programs are part of the test, all of them were trained using the same hyper-parameters but a random seed:
leela
- Leela 0.11.0dg-d-128-1
- No dilationdg-d-128-2
- 96 convolution, and 32 2-dilationdg-d-128-2-3
- 96 convolution, 16 2-dilation, and 16 3-dilation.dg-d-192-2-3
- 128 convolution, 32 2-dilation, and 32 3-dilation.I will update the following section with the results, but the expected results would be the following ranking based on the assumption that the network sanity tests have some correlation to reality. leela
has been omitted from the list because it is just there to get some anchor to reality. The associated number of the number of steps per second during training (so higher is better):
dg-d-192-2-3
(0.95)dg-d-128-2
(1.80)dg-d-128-2-3
(1.34)dg-d-128-1
(2.29)This trial was cancelled after 37 games (for every match-up, so a total of 367 games) had been played since some match-ups could be eliminated due to a winner having already been determined. The most notable of which is all matches against leela, which performing very badly for some reason (pretty sure it should be stronger than this). The other candidate that could be eliminated is dg-d-192-2-3
which performed the worst of all candidates.
The remaining three candidates were put into another match-up that we can use to determine which of them were worth continuing with:
leela v dg-d-128-1 (37/50 games)
board size: 19 komi: 7.5
wins black white avg cpu
leela 1 2.70% 0 0.00% 1 5.56% 146.53
dg-d-128-1 36 97.30% 17 94.44% 19 100.00% 340.32
17 45.95% 20 54.05%
leela v dg-d-128-2 (37/50 games)
board size: 19 komi: 7.5
wins black white avg cpu
leela 2 5.41% 0 0.00% 2 11.11% 163.02
dg-d-128-2 35 94.59% 16 88.89% 19 100.00% 404.39
16 43.24% 21 56.76%
leela v dg-d-128-2-3 (37/50 games)
board size: 19 komi: 7.5
wins black white avg cpu
leela 1 2.70% 1 5.26% 0 0.00% 158.89
dg-d-128-2-3 36 97.30% 18 100.00% 18 94.74% 413.07
19 51.35% 18 48.65%
leela v dg-d-192-2-3 (37/50 games)
board size: 19 komi: 7.5
wins black white avg cpu
leela 1 2.70% 1 5.26% 0 0.00% 171.99
dg-d-192-2-3 36 97.30% 18 100.00% 18 94.74% 377.70
19 51.35% 18 48.65%
dg-d-128-1 v dg-d-128-2 (37/50 games)
board size: 19 komi: 7.5
wins black white avg cpu
dg-d-128-1 24 64.86% 13 68.42% 11 61.11% 630.82
dg-d-128-2 13 35.14% 7 38.89% 6 31.58% 675.98
20 54.05% 17 45.95%
dg-d-128-1 v dg-d-128-2-3 (37/50 games)
board size: 19 komi: 7.5
wins black white avg cpu
dg-d-128-1 15 40.54% 7 36.84% 8 44.44% 659.20
dg-d-128-2-3 22 59.46% 10 55.56% 12 63.16% 700.49
17 45.95% 20 54.05%
dg-d-128-1 v dg-d-192-2-3 (36/50 games)
board size: 19 komi: 7.5
wins black white avg cpu
dg-d-128-1 22 61.11% 13 72.22% 9 50.00% 752.21
dg-d-192-2-3 14 38.89% 9 50.00% 5 27.78% 608.40
22 61.11% 14 38.89%
dg-d-128-2 v dg-d-128-2-3 (36/50 games)
board size: 19 komi: 7.5
wins black white avg cpu
dg-d-128-2 18 50.00% 8 44.44% 10 55.56% 632.58
dg-d-128-2-3 18 50.00% 8 44.44% 10 55.56% 653.96
16 44.44% 20 55.56%
dg-d-128-2 v dg-d-192-2-3 (36/50 games)
board size: 19 komi: 7.5
wins black white avg cpu
dg-d-128-2 23 63.89% 13 72.22% 10 55.56% 858.17
dg-d-192-2-3 13 36.11% 8 44.44% 5 27.78% 651.65
21 58.33% 15 41.67%
dg-d-128-2-3 v dg-d-192-2-3 (36/50 games)
board size: 19 komi: 7.5
wins black white avg cpu
dg-d-128-2-3 20 55.56% 8 44.44% 12 66.67% 710.56
dg-d-192-2-3 16 44.44% 6 33.33% 10 55.56% 585.97
14 38.89% 22 61.11%
After these match-ups the following ELO could be estimated, there are not enough games to determine an accurate rank and I consider the top 3 to be essentially equal. It is unknown why dg-d-192-2-3
performed as bad as it did, but the main theory would be that the larger network is slower, and the increased accuracy does not compensate for the decreased number of rollouts the engine can perform:
leela:0.11.0 0.00
dg-d-192-2-3:0.5.0 533.04
dg-d-128-2:0.5.0 579.99
dg-d-128-2-3:0.5.0 612.88
dg-d-128-1:0.5.0 616.77
The same argument can be used to explain why dg-d-128-1
is ranked as number one, despite performing the worst of all networks on the sanity tests. But looking at the actual games listed above one can come to a different conclusion.
The dg-d-128-1
network plays very good local shape, but usually fails to account for global properties. However, most of the time good local shape leads to good global shape so it doesn't have too worry that often. Adding any dilation will reduce the number of channels the neural network can use to generate good local shape, in two ways:
These two reasons interact, since there are more large patterns than there are local pattern, so it has to look at the global scope but has fewer channels to do so. This will result in it having to generalize global shape into local shape, which may not always work out. A few other observations to keep in mind:
dg-d-128-2-3
has 25% of its channels devoted to global thinking.dg-d-192-2-3
has 33% of its channels devoted to global thinking.So the larger the fraction of channels that are devoted to global thinking, the harder it will be for the network to be able to recognize local shape (because of the regularization factor mentioned above).
The problem presented above have two issues:
The second problem is easy to solve, we could just decrease the regularization coefficient or drop the second residual blocks from the regularization completely. We could also do some gated architecture as below, using, for example, the batch normalization scale parameter as G₂
and G₃
:
x
├───┬───╮
D₁ D₂ D₃
│ │ │
│ G₂ G₃
├───┴───╯
C
│
y
It is not obvious if we have to solve the first problem, or if solving the second is enough for the optimizer to reserve some channels for local properties on its own. The only solution to the first problem we can think of would be to run separate towers for the different dilation levels and then combine then at the final layer but this has several issues on its own. Some hybrid approaches where only some residual blocks use dilation might also work.
What do you mean by "local" properties versus "global" properties? If either way a property of the Go position is computed accurately ("this stone belongs to a group that has only one eye within radius 6 of this location") it does not matter if the computation of that property involved convolutions with different dilation levels or not. Some properties may be easier or harder to compute using different mixes of different dilations of course, but I think there's no reason to try to avoid blending them, because there's no such thing in the first place as an intrinsically "dilation 1" feature or a "dilation 2" feature that can only usefully be used by further convolutions with the exact same dilation factor.
I'm possibly misunderstanding something?
Also, I'm curious- what regularization are you referring to? Keep in mind that if you're using an L2 penalty on your weights but you're also using relus and batchnorm, then my understanding is that the L2 penalty does not have a significant regularization effect to begin with, so it has no relevance to whether any features are on equal footing with others or not. But if you're using a different regularization method things might be different.
It's cool to see these updates. I'd be interested to hear if you have results from your blitz games yet - it's possible the reduced performance is a bigger cost than the gain from better large-scale understanding, but if not, that would be really neat. :)
On Wed, Apr 11, 2018 at 1:32 PM, Karl Sundequist Blomdahl < notifications@github.com> wrote:
The problem presented above have two issues:
- Global properties that has been blended into local properties in previous residual blocks.
- The second convolutional layer in each residual block having to consider both dilated and non-dilated features as equals due to the regularization.
The second problem is easy to solve, we could just decrease the regularization coefficient or drop the second residual blocks from the regularization completely. We could also do some gated architecture as below, using, for example, the batch normalization scale parameter as G₂ and G₃:
x ├───┬───╮ D₁ D₂ D₃ │ │ │ │ G₂ G₃ ├───┴───╯ C │ y
It is not obvious if we have to solve the first problem, or if solving the second is enough for the optimizer to reserve some channels for local properties on its own. The only solution to the first problem we can think of would be to run separate towers for the different dilation levels and then combine then at the final layer but this has several issues on its own. Some hybrid approaches where only some residual blocks use dilation might also work.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Chicoryn/dream-go/issues/25#issuecomment-380533853, or mute the thread https://github.com/notifications/unsubscribe-auth/ALY5-6E2_jHqesTngD0ILHBuA9GLHl-Bks5tnj4xgaJpZM4R7nmu .
To answer your question in a random order:
I can add some blitz games, they should be fast to play. In fact I would not be surprised if this post has some in it since I started some just as I was typing this sentence and I'm planning on writing a fair bit more.
These blitz games are what we internally refer to as policy play games, i.e. they play greedily according to what the neural network suggests with no search. So these results should indicate the quality of the neural network predictions:
dg-d-128-1 v dg-d-128-2 (100/100 games)
board size: 19 komi: 7.5
wins black white avg cpu
dg-d-128-1 45 45.00% 24 48.00% 21 42.00% 3.21
dg-d-128-2 55 55.00% 29 58.00% 26 52.00% 4.05
53 53.00% 47 47.00%
dg-d-128-1 v dg-d-128-2-3 (100/100 games)
board size: 19 komi: 7.5
wins black white avg cpu
dg-d-128-1 39 39.00% 18 36.00% 21 42.00% 3.23
dg-d-128-2-3 61 61.00% 29 58.00% 32 64.00% 4.18
47 47.00% 53 53.00%
dg-d-128-1 v dg-d-192-2-3 (100/100 games)
board size: 19 komi: 7.5
wins black white avg cpu
dg-d-128-1 46 46.00% 23 46.00% 23 46.00% 3.31
dg-d-192-2-3 54 54.00% 27 54.00% 27 54.00% 4.97
50 50.00% 50 50.00%
dg-d-128-2 v dg-d-128-2-3 (100/100 games)
board size: 19 komi: 7.5
wins black white avg cpu
dg-d-128-2 42 42.00% 21 42.00% 21 42.00% 4.06
dg-d-128-2-3 58 58.00% 29 58.00% 29 58.00% 4.16
50 50.00% 50 50.00%
dg-d-128-2 v dg-d-192-2-3 (100/100 games)
board size: 19 komi: 7.5
wins black white avg cpu
dg-d-128-2 52 52.00% 25 50.00% 27 54.00% 4.21
dg-d-192-2-3 48 48.00% 23 46.00% 25 50.00% 5.03
48 48.00% 52 52.00%
dg-d-128-2-3 v dg-d-192-2-3 (100/100 games)
board size: 19 komi: 7.5
wins black white avg cpu
dg-d-128-2-3 55 55.00% 28 56.00% 27 54.00% 4.33
dg-d-192-2-3 45 45.00% 23 46.00% 22 44.00% 5.03
51 51.00% 49 49.00%
Results mirror what my sanity tests suggests, that dilations results in higher quality predictions. The exception is dg-d-192-2-3
which performs below average (again). The estimated elo of the networks based on these games is the following:
dg-d-128-1:0.5.0 0.00
dg-d-192-2-3:0.5.0 29.83
dg-d-128-2:0.5.0 33.33
dg-d-128-2-3:0.5.0 77.23
I claim the reason dg-d-192-2-3
performs worse than dg-d-128-2-3
(even though the later could fit inside of the first), is because the ratio of dilated and non-dilated channels is larger in the first. There are other factors that could be responsible, such as random initialization and just bad luck (since we only played 100 games).
I've been using global properties and local properties somewhat fuzzily intentionally since I cannot claim to understand exactly what the neural network computes in the first place, and I am not a professional baduk player. A better word for what I mean might be near periphical features and far periphical features. Where near periphical features contains information about stones close to the centre of each convolution, and far periphical features are about stones far from the centre of each convolution.
The networks with dilated convolutions seems to favour strategies that involves far periphical features, so things like large scale captures, and influence. The networks with dilation also perform worse during Life & Death problems, that would involve mostly near periphical features. This is all from me skimming some of the games in the archive I linked above and it is possible I am wrong.
The reason for this behaviour, I think, is that the way we have been adding dilation force the network to reserve some channels to far periphical features, whereas before it had some choice on this†. My current concerns about dilation is related to the forcing part, which is explored further in section 3.
Since different features are easier to compute with input from different dilation levels. If we force the network to always consider the far periphical features then it will have a harder time computing some features, and vice versa for near periphical features.
For example it may be hard to recognize an eye, if you have no choice but to look at 5x5 patterns since the stones marked with ?
are irrelevant to whether it is an eye or not, and having to store all combinations of these 16 question marks takes up considerable space in the weights:
? ? ? ? ?
? X X X ?
? X X ?
? X X X ?
? ? ? ? ?
This is of course a simplified example, the optimizer is not so stupid it would store all combinations of the 16 question marks. But something similar to this is going on since you can observe the style differences between the networks.
† Recent research calls the claim that it could choose to do some into question.
I am using L2 regularization, batch normalization, and gradient clipping during the training. The last one is unnecessary at this point, but was useful while I still had a bug during initialization where it would sometimes try to factorize singular matrices (resulting in huge weights that would collapse the entire network to zero without clipping).
You are correct in that batch normalization and L2 regularization is somewhat redundant, but my understanding says that L2 regularization achieves two things:
It is the second effect that I am worried about since it says D₁
, D₂
, and D₃
must all be equally important to whatever C
is computing (using notation from the figure below), and since some features are easier to compute without D₂
or D₃
this constraint might make it hard for the network to learn certain features.
x
├───┬───╮
D₁ D₂ D₃
├───┴───╯
C
│
y
My main reason for worrying about this is the fact that the network with dilation seems to play worse than than the network without dilation. The blitz game results suggests that this is mainly due to the lack of search due to time constraints, which would also give rise to the same problem where it would fail to notice vital moves during non-trivial situations as they require reading to spot.
I will train another network with the following configuration, without L2 regularization to see if it has any significant effect. I do not believe we will have any problems with overfitting, and if the L2 regularization had no impact then results should be the same as before:
D₁
- 96 channelsD₂
- 16 channelsD₃
- 16 channelsThis mirrors the dg-d-128-2-3
architecture, and it seems to be the currently best network. I expect this to take about 2 days, 2 hours, and 30 minutes.
PS: My idea about separate towers for different dilations is probably stupid, and does not really make sense upon further inspection.
Sorry for the confusion: I misread your earlier post, where you said you were running some fast (non-blitz) games. Did those pan out, or were the new neural nets worse once taking into account the shallower search due to the worse performance?
For these games though - nice results!
I think I understand you. Yes, if the dilated channels are there, then the neural net will use them, and therefore devote some proportion of the non-dilated channels to computing features that are useful for the dilated channels, so long as that improves the overall loss function more than not doing so. That will obviously make it worse at doing whatever the excess non-dilated channels were doing before.
I don't think this is a problem though, it simply is a tradeoff of network capacity. To give a different example - imagine you originally did not provide any history planes as an input feature, and now you add some, but don't increase the number of channels in the rest of the neural net. Then obviously the neural net will get worse at some kinds of tactics, because it will be now devoting some of its channels to processing the new history information, instead of devoting them to whatever local shapes it was doing before. But doing so improves the overall prediction quality, because the new history information is strongly predictive of other things.
I don't think there is any special about dilated/nondilated, it is exactly like adding any other new information or representation capacity. It will cause a tradeoff to use the new capacity, but in a way that (unless you're experiencing major underfitting or overfitting) should be overall better in predictive ability. Of course, better predictive ability does not always mean more strength, because predictive ability and playing strength are two different things.
I'm putting this in a separate reply because this gets pretty technical.
Yes, of course. As in #2 above, I think there is no significant pathology or problem with "mixing" this kind of information, if there is a loss in strength there is a good chance it is due to something like:
You mentioned this:
Due to its quadratic nature, it encourage all weights to be roughly the same size. This is to avoid overfitting to only some features. Batch normalization does not do this.
But actually, to first order L2 loss does NOT encourage all weights to be the same relative size as each other and does NOT have a significant effect on effect on overfitting, in the presence of batch normalization, and if you are using gradient descent. I could be mistaken, but I'm reasonably sure about this. This is a surprising fact if you have not thought about it before!
Why?
This means there is no regularization effect or avoidance of overfitting. For example, imagine we are at a local minimum in data loss where weight A and B serve extremely similar purposes but weight A is twice as large as B. Then after scaling due to the L2 loss step, weight A will still be twice as large as B, and there will still be no data loss gradient to change them, so even as both shrink, A will remain twice as large as B forever.
This is different than if there was no batchnorm. In that case, there after scaling there would be a data loss gradient to re-increase both A and B since both are now too small. If A and B serve the same purpose, then A and B would experience the same gradient upward, but since B is only decreased by the L2 loss half as much but it is re-increased just as much, the net effect will over time to make A = B, as expected.
So with batch norm, the only thing that is affected by L2 loss is the global scale of the weights in each layer. This still does have an effect on the gradient, but only on the scale. Multiplying all the weights in a layer by a constant factor C followed by a batchnorm causes all gradients in that layer to be multiplied by 1/C during the backward pass, which is an effective factor of 1/C^2 in the relative gradient (relative to the magnitude of the weight).
If you are using momentum, then the picture is changed a bit, but broadly I think the same analysis holds. If you are using an entirely different kind of optimizer, such as ADAM, then the above does not hold, but I think L2 still have a very strange effect that is very different than the regularization it has without batchnorm.
L2 loss causes no regularization effect or avoidance of overfitting when using batch normalization because it has no effect on the predictions or on the relative directions of gradients. Instead, to first order it only affects the scale of weights, so it is approximately the same as training without any L2 but performing a slight increase in the learning rate on every iteration.
The result of the fast games are in this reply. I can understand the confusion since I tend to heavily edit my posts as most of the time they end up just being an experimental log that no one (?) except me reads.
In summary the results of the fast games were mixed, adding dilated convolutions produced some better and some worse (!) engines. But there was no significant jump in strength, I suspect mostly because of the performance issues, resulting in fewer rollouts for the networks using dilation.
I think we agree here, which features were added / removed / changed does not really matter, and some change in behaviour is to be expected. If this L2 regularization thingy is something (I'll get to your second post later), then it would be a problem for features outside of dilation too.
My worry originally comes from the fact that dg-d-192-2-3
performs worse (even during blitz) than any other network using dilation despite being larger. But this could be because it is larger, but was trained for the same number of steps as the other networks so it received "less training per weight" (I am not sure if this is a thing).
I am using SGD with momentum but I do not believe this is important for this discussion as the analysis should turn out the same.
I think your analysis is correct, but the assumption that the weight_decay
(L2 regularization) and gradient descent will be done in sequence does not hold in practice where they are done independently.
If batch normalization and weight decay is performed independently then I believe the L2 regularization still encourage the weights to be roughly the same magnitude. This is because the SGD update formula with weight_decay
and a constant gradient
norm puts a hard limit on the size of the weights, and it will be hard for the optimizer to maintain the effort it needs to counter the weight_decay
considering how noisy the gradients are in SGD (though less so when using momentum):
next_weights = weights - weight_decay · weights - gradients
If we want the next_weights
to increase then gradients
must be larger than the weight_decay · weights
so weights
has an upper bound of -gradients / weight_decay
:
next_weights > weights
⇒ weight_decay · weights > -gradients
⇒ weights > -gradients / weight_decay
I might look further into this tomorrow, but it is getting a bit late at the moment.
Regardless of the reason, we both seem to agree that removing the L2 regularization I am using at the moment is a good idea, since:
Sounds good.
Regarding the sequential application of L2 and data loss gradient, I agree that doing them not in sequence is a small difference, but it is much smaller than the first-order effect that batch norm takes away (consider that except at the very start of training, a single gradient update usually changes each weight by a miniscule fraction of a percent of the root mean square of the weights, so the second order effects from sequence vs simultaneous are very small).
Keep in mind that all of the analysis I wrote above is about the relative magnitudes of the weights. It doesn't matter if the optimizer can counter the weight decay or not, because decayed weights behave the same as undecayed weights if the next layer is a batch norm, what matters is the relative magnitude. The weight A and weight B example I gave is a good example to go back to. L2 penalty causes both A and B to shrink proportionally, with A continuing to be twice as large. The regularizing behavior would be to make A/B ~= 1, and if there is no batch norm, that is what you get, because the data provides pressure to increase them again, such that the local optimum has A/B ~= 1. But when there is batch norm, there is no such pressure, the decayed weights are just as good. If A and B both shrink enough and then random walk due to noise enough so that randomly B could become larger sometimes, then of course you are free to call that "regularization", but it is exactly the same kind of "regularization" as if you removed L2 loss entirely and just turned up the learning rate enough that B could randomly sometimes overtake A. Sometimes the opposite would happen and it would become even smaller or negative compared to A. This is a very different kind of "regularization" than one that actually encourages and converges to precisely A/B ~= 1 as the unique local optimum.
Edit: Fixed some spacing. Also, looking forward to any further updates in the future, this thread has been great and following it has been very interesting! :)
On Wed, Apr 11, 2018 at 10:08 PM, Karl Sundequist Blomdahl < notifications@github.com> wrote:
Blitz games
The result of the fast games are in this reply https://github.com/Chicoryn/dream-go/issues/25#issuecomment-379918474. I can understand the confusion since I tend to heavily edit my posts as most of the time they end up just being an experimental log that no one (?) except me reads.
In summary the results of the fast games were mixed, adding dilated convolutions produced some better and some worse (!) engines. But there was no significant jump in strength, I suspect mostly because of the performance issues, resulting in fewer rollouts for the networks using dilation. Near and Far features
I think we agree here, which features were added / removed / changed does not really matter, and some change in behaviour is to be expected. If this L2 regularization thingy is something (I'll get to your second post later), then it would be a problem for features outside of dilation too.
My worry originally comes from the fact that dg-d-192-2-3 performs worse (even during blitz) than any other network using dilation despite being larger. But this could be because it is larger, but was trained for the same number of steps as the other networks so it received "less training per weight" (I am not sure if this is a thing). Regularization
I am using SGD with momentum https://github.com/Chicoryn/dream-go/blob/master/contrib/trainer/dream_tf/__main__.py#L658 but I do not believe this is important for this discussion as the analysis should turn out the same.
I think your analysis is correct, but the assumption that the weight_decay (L2 regularization) and gradient descent will be done in sequence does not hold in practice http://pytorch.org/docs/master/_modules/torch/optim/sgd.html#SGD where they are done independently.
If batch normalization and weight decay is performed independently then I believe the L2 regularization still encourage the weights to be roughly the same magnitude. This is because the SGD update formula with weight_decay and a constant gradient norm puts a hard limit on the size of the weights, and it will be hard for the optimizer to maintain the effort it needs to counter the weight_decay considering how noisy the gradients are in SGD (though less so when using momentum):
next_weights = weights - weight_decay · weights - gradients
If we want the next_weights to increase then gradients must be larger than the weight_decay · weights so weights has an upper bound of -gradients / weight_decay:
next_weights > weights ⇒ weight_decay · weights > -gradients ⇒ weights > -gradients / weight_decay
I might look further into this tomorrow, but it is getting a bit late at the moment. Conclusion
Regardless of the reason, we both seem to agree that removing the L2 regularization I am using at the moment is a good idea, since:
- You claim it does nothing except waste GPU cycles during training.
- I claim it might hurt neural network performance.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Chicoryn/dream-go/issues/25#issuecomment-380651843, or mute the thread https://github.com/notifications/unsubscribe-auth/ALY5-0CmjFqeBbbxZpdDw5U009oICt1Zks5tnrcygaJpZM4R7nmu .
@lightvector I think I followed what you wrote, my understanding of how BN works is very weak. Are there any take aways from this that could apply to some of the other projects like LZGo, minigo, LZChess? Minigo and LZChess are having trouble recently.
In the context of LZ, I think generally having some amount of L2 as is the case right now, is probably good. Not because of regularization, but to maintain the learning rate. Because for LZ, you train the net essentially forever, even new nets you often use net2net to bootstrap. If you don't have any L2 loss (i.e. weight decay), then over time the norm of the weights will drift larger (e.g. a high-dimensional brownian motion will very reliably move away from from the origin proportional to sqrt(time)). This means your effective learning rate will drop over time**, which is bad because because as you are always receiving new and better data, you want to maintain a fixed and high learning rate. But with some amount of L2, you will reach an equilibrium where the outward drift is balanced by the inward decay, and therefore maintain your effective learning rate.
For fixed-data sets (e.g. one-shot training a policy net to convergence on a fixed set of pro games), I don't think there is any particular value in L2 with batchnorm, since you want to anneal your learning rate anyways. Except maybe it makes your learning rate easier to think about if you tune it so that the equilibrium is actually roughly equal to the scale of your weights that you initialize with, so that now the only factor that affects your effective learning rate is your literal learning rate, rather than also this subtle weight-growth phenomenon.
I'm not sure if this amounts to any particular takeaway for LZ other than what it's already doing. If you are curious, you can check if you reached equilibrium by simply printing out the norm of the weights every few million training steps and seeing if the weight norm is no longer changing much. LZ almost certainly has.
* If you're having trouble following, it's pretty simple. Consider a Z = batchnorm(Y) where Y = W X. Double the weights W. Then, Y is doubled. So batchnorm is now dividing by an extra factor of 2 to undo the doubling. So dZ/dY is cut in half. Therefore dZ/dW is cut in half. Also because of batchnorm, we know only relative changes to weights matter. If we perform one update W := W - learningratedZ/dW, then since W was doubled and dZ/dW is half the size, the relative* step is one quarter as large now. So effectively the learning rate has been divided by 4.
I think you are correct, I mentioned that I started training a new network a few posts age and the results (so far) match your predictions. The loss was virtually the same in the beginning but it fails to keep up with the other networks towards the end, your explanation about the effect of L2 regularization on the learning rate would explain this behaviour.
A takeaway from this is also that I am not training long enough (sigh), since it still benefits from an increasing learning rate. I included a screenshot of the accuracy and loss of the different networks below, the network without L2 regularization is the pink line that is noticeably trailing behind towards the end:
The pink line is hard to see in some of the charts because they all converge within a percentage or two of each other anyway, but the network without L2 regularization is performing worse in all of the metrics.
This raises some interesting questions for me, at the moment I do not use batch normalization nor L2 regularization on the weights in the value head and policy head (I forget if there is a good reason for this). This might be a mistake if we are saying that the combination of batch normalization + L2 regularization is effectively just a dynamic learning rate boost, since then the weights in those heads are missing out on some training towards the end.
PS: You idea about looking at the norm of the weights as an indication on whether the learning rate is too small or too large is an interesting one since my norms would suggests my learning rate schedule is not too great (the norms are monotonically decreasing) and it should be relatively trivial to implement a dynamic learning rate based on the previous (maybe a moving average) and the current norm of the weights.
A screenshot of the per-variable norms are included below, you can ignore the norm of the offsets (aka biases or β is the batch normalization formula), which do not have L2 regularization:
On an unrelated note, did you do any experiments with what percentage of games to "drop" the history inputs during? You mention 5-10% in your repository, but is there any experimental data on what percentage yielded the best results?
I am wondering because looking at the monte-carlo trees one can clearly observe that the neural network has a strong tendency to memorize sequences of play, which is not a desirable property. The search can still overrule the suggestions of the neural networks of course but that is a waste of GPU cycles if we can just fix the problem in the network instead. Example a self-play game where you can see this behaviour is attached (see the fake ko fight in the corner).
The game includes all variations considered by the search, so it is pretty big (even if I only did 1,600 rollouts):
Thanks for asking! No, I didn't test this. I don't have any value net or MCTS component yet, so it's hard for me to do experiments regarding what would improve search. My first project goal needs only policy (neural-net-aided explorations of human biases by players of different ranks), so that's what I've started with.
I chose around 10% because it was a simple conservative number large enough to get the behavior I wanted but small enough to be very unlikely to cause much worse prediction when history was still present. If I were to experiment in the future, it would probably be:
(Edit: One minor detail - with my input representation, the neural net can always determine ko legality regardless of history or no-history. I would make sure that this remains the case, because I don't want no-history to also blind the neural net as to what moves are legal in the first place in ko fights, the neural net should of course always be given enough info to determine legality of current move).
Do you have a source on the linear interpolation behaviour of relu? I would be interested in reading more about that since I was skimming my list of features to make sure it could still determine whether a move / ko is legal or not without the history planes, when I noticed that according to previous experiments that I have done the "Am I black or white?" feature only affects the accuracy of the value head by about -2.4% (but it did not affect the policy head accuracy at all). These experiments are not very recent so a lot could have changed, but this suggests said feature is mainly used for komi.
The number of testing games* that that are won because of komi are up to 13.2% (depending on how you score draws), which suggests the feature doesn't do its job too well† but if it did then you could implement a dynamic komi between -7.5 and 7.5 by setting the "Am I black or white?" feature plane to 0.5 + komi / 7.5
assuming some sort of interpolation. Not sure if it would also extrapolate to a larger or smaller komi.
We are still trying to fix the learning rate after dropping the L2 regularization, which is very time consuming. We figured we'll try one of the fancy automatic learning schedules just for the sake of it, but said learning schedule is proving unstable as my training input is a mixture between three different data-sets which can cause some bad mini-batches and as a consequence a very noisy loss:
The original reason for mixing different datasets was to expose the neural network to moves that may only appear during reading in higher ranked games, and especially with lower ranked players to get it to distrust the opponents moves. It is not obvious if it is necessary to do this anymore if we start messing around with the history features as it would achieve the same thing, and the first point doesn't actually make sense.
So we could probably swap to a more monotonic dataset that would result in a less noisy loss, and therefore train faster.
Mostly my intuitions about linearity come from a few papers like https://arxiv.org/pdf/1412.6572.pdf showing behavior in extrapolation that is linear-like (although later papers point to linearity of behavior as very much not being the sole factor in the existence of adversarial examples). Also posts like http://mlg.eng.cam.ac.uk/yarin/blog_3d801aa532c1ce.html that show that there it is possible to give a Bayesian interpretation to some of what neural nets do, and the fact that if along a particular dimension you only have two data points (the value at history = 0 and the value at history = 1), under many sensible priors your posterior will look like a linear interpolation.
I have not tested it, so this could be wrong. Maybe in reality it varies in a wiggly or jumpy fashion as you go from 0 to 1, in a way that unpredictably varies between differently-initialized nets trained on the same data. I expect that even if it is approximately linear, it is not exactly so, and will randomly wiggle and curve depending on the position.
I would NOT rely such an interpolation to handle a variety of komi when training a value net under an AlphaZero-like process. Instead I would just have my self-play games generated with a large variety of different komi, telling it what the komi was in each case. This is much more likely to be reliable, since rather than praying that the neural net interpolates correctly to 0.5 komi when only given -7.5 komi and 7.5 komi examples, you simply train it directly on 0.5 komi games, or 3.5 komi games, etc. Of course, you are unlikely to find such games from human data sets that don't suffer from bias (e.g. 0.5 komi games mean that White was a higher-ranked player), so this does require that you generate your data set yourself.
The reason I had such a crazy hack in mind to try testing for the history feature that I would definitely not want to use for komi was because I don't know a reasonable way where you can train with only "halfway providing" the correct history. You can train with only providing half the komi, because that actually changes the nature of your training data (affecting whether some positions are winning and losing), but I don't know how you do that with history. Either you provide history, or you don't.
You could try providing it with noise, for example making it 50% likely to be an incorrect history versus a correct one, but that seems very hard to do well because unless your noise distribution is highly plausible among histories that could have led to this position, the neural net will probably just learn to distinguish incorrect histories from correct ones, and then it will know mostly whether to pay attention to it or not.
Re-reading Explaning and harnessing adversarial examples suggest to me that the interpolation is only locally linear around each known value, hence the ε. So for real valued features the interpolation will probably be fairly smooth but it is not obvious how it would behave for binary input features [1]. So using it for komi would almost certainly not work unless we provide it with actual examples as you suggest.
Never the less, my survey of articles about adversarial examples has further convinced me that perturbing the history features is actually very important for real-life performance. I will train the following networks, when I've finished re-tuning the learning rate, using the d-128-2-3 network as a base:
You have to be a little bit careful when shuffling to avoid feeding the neural network rubbish, which would probably cause it to just ignore those features. So I suggest the following limitations:
You could go further and provide similar random positions from other games, but I am not sure there is any point since that will not occur in practice.
[1] http://colinraffel.com/publications/iclr2018thermometer.pdf
Alternative solution to the history problem could be done using adversarial networks as suggested by Stijn Tonk [2]. Where we would, as part of the training, train an additional network that given some policy (and some additional data like the current board position?) tries to predict the previous move, we then include the loss of this adversarial networks during training.
Unclear how well this would fit to Go, but just noting it here in case I want to look into it later.
[2] Stijn Tonk, https://blog.godatadriven.com/fairness-in-ml
These are the experimental results of setting the occupied vertices in the history planes to different constants instead of 1.0
. Note that we do not use one-hot encoding of the history planes, we use the AlphaGo representation which will probably skew these results, and without training the network using the methods described above it may just give essentially random results:
dg-h-000
sets the occupied vertices to 0.0
. Example Gamedg-h-010
sets the occupied vertices to 0.1
. Example Gamedg-h-025
sets the occupied vertices to 0.25
. Example Gamedg-h-050
sets the occupied vertices to 0.5
. Example Gamedg-h-075
sets the occupied vertices to 0.75
. Example Gamedg-h-100
sets the occupied vertices to 1.0
. Example GameYou can make some interesting observations from the example games (played with the --self-play
option so the the entire games, especially the openings, are pretty random):
I also played a blitz tournament between the top 3 engines dg-h-050
, dg-h-075
, and dg-h-100
. The results are very weird to me, since dg-h-100
ends up losing to all of the other engines by a pretty wide margin:
dg-h-050 v dg-h-075 (21/100 games)
unknown results: 3 14.29%
board size: 19 komi: 7.5
wins black white avg cpu
dg-h-050 11 52.38% 4 36.36% 7 70.00% 11.03
dg-h-075 7 33.33% 2 20.00% 5 45.45% 11.11
6 28.57% 12 57.14%
dg-h-050 v dg-h-100 (20/100 games)
board size: 19 komi: 7.5
wins black white avg cpu
dg-h-050 20 100.00% 10 100.00% 10 100.00% 9.33
dg-h-100 0 0.00% 0 0.00% 0 0.00% 9.13
10 50.00% 10 50.00%
dg-h-075 v dg-h-100 (20/100 games)
unknown results: 2 10.00%
board size: 19 komi: 7.5
wins black white avg cpu
dg-h-075 16 80.00% 9 90.00% 7 70.00% 9.92
dg-h-100 2 10.00% 2 20.00% 0 0.00% 10.24
11 55.00% 7 35.00%
I cancelled this blitz tournament after 20 games in each match-up since the results were pretty clear (and very weird), and instead re-started a tournament with 1,600 rollouts to check if this also affects the value head. I will post the results when they are done, said tournament should take a day or so to complete.
Out of curiosity, I tested this now. I checked what happens if I feed in the history planes at 0.5 weight in my neural net. It uses one-hot indicators of the location of the previous several moves (each move in its own channel) instead of the AlphaGo representation, and as I mentioned before, was specifically trained to behave well at both 0 (history absent) and 1 (history present). Informally, I looked by hand at the colored prediction heatmaps of several dozen positions over several games, comparing between history plane at 0, 0.5, and 1.0 weight.
It definitely does interpolate. In every case I looked at by hand, the resulting 0.5 heatmap looked reasonable and was roughly "in between" the 0 and 1 heatmaps. I did not find any example where it did anything crazy or significantly non-interpolation-like.
However, it definitely was NOT a linear or consistent interpolation. While sometimes the 0.5 heatmap was pretty close to an average of the 0 and 1 heatmaps, sometimes also it was much closer to either the 0 or the 1 heatmaps alone, e.g. more like 0.1 [0 heatmap] + 0.9 [1 heatmap]. Also, it was not always uniform between moves on the board - sometimes move as you went 0 -> 0.5 -> 1, move A would light strongly from 0->0.5, and B only a little, and then from 0.5 -> 1, move A would only light up a bit more, while B would light strongly. This seemed more common when A and B were far apart on the board and/or differed in whether they were next to the last move or not.
Still, often it did give something clearly in-between. So, pretty interesting! :)
History
Half-history
No-history
I completely forgot about the interpolation behavior as I was so surprised at the tournament results. No history in this case is dg-h-000
, so the history features are set to zero, which is very different from what it was trained on. The identity version is with all of the history features set to the current board position:
History:
Half-history:
No-history:
No-history (identity):
Results from the tournament with playouts (1,600) looks as I expected the blitz games too look. The history features seems to contribute to an increased network strength. Since this is the opposite of the blitz results the obvious conclusion is that the value head collapse without the history features but the policy head can keep going.
It is unclear if the value head should benefit from the history features, but since we only compute a single shared representation for both the policy and value head there is not much choice for the value head other then using the history features if they allow for a lower net loss in the policy head:
dg-h-050 v dg-h-075 (100/100 games)
unknown results: 16 16.00%
board size: 19 komi: 7.5
wins black white avg cpu
dg-h-050 3 3.00% 3 6.00% 0 0.00% 489.50
dg-h-075 81 81.00% 42 84.00% 39 78.00% 467.57
45 45.00% 39 39.00%
dg-h-050 v dg-h-100 (100/100 games)
unknown results: 14 14.00%
board size: 19 komi: 7.5
wins black white avg cpu
dg-h-050 9 9.00% 3 6.00% 6 12.00% 527.77
dg-h-100 77 77.00% 37 74.00% 40 80.00% 489.52
40 40.00% 46 46.00%
dg-h-075 v dg-h-100 (100/100 games)
unknown results: 14 14.00%
board size: 19 komi: 7.5
wins black white avg cpu
dg-h-075 27 27.00% 13 26.00% 14 28.00% 456.90
dg-h-100 59 59.00% 31 62.00% 28 56.00% 457.94
44 44.00% 42 42.00%
Just letting you know that I implemented your idea in https://github.com/Chicoryn/dream-go/issues/25#issuecomment-381417656 for Leela Zero at https://github.com/gcp/leela-zero/issues/1599 together with dynamic komi. It was pointed out by @TFiFiE that the formula 0.5 + komi / 7.5
should be 0.5 + komi / 15.0
.
@alreadydone That is very cool. I've not read your entire thread but your implementation seems to be working much better than I expected when I coined the original concept.
As for the monotonicity of different networks, my intuition says this has to do with overfitting of the network to minor correlations in the training data, which could be due to too little training data or too low of a learning rate (or other reasons, this is not a solved problem). If you have a lot of spare GPU cycles you might want to consider training a robust network [1] using something like PGD, which should avoid many of the local maximums that breaks the monotonicity as they are effectively adversarial examples to your network.
[1] Towards Deep Learning Models Resistant to Adversarial Attack
Check out some of ideas mentioned here to enhance the neural network:
@lightvector https://github.com/lightvector/GoNN
Initial thoughts on the concepts without any further research to back it up:
Chain Pooling This probably gives worse results since the pooling destroy the local shape information. Might be interesting to do a dense block approach where each residual blocks becomes a DenseNet [1]. This way each residual block would benefit from both the pooling and the local shape:
x
compute the the chain pooling of each channel and store the result inyₛ
.x
andyₛ
channel-wise intoyₓₛ
(soyₓₛ
has shape[256, 19, 19]
).y₁ <- relu(C(W₁, yₓₛ) + b₁)
(y₁
has shape[128, 19, 19]
)y₁
andyₛ
channel-wise intoy₁ₛ
(soy₁ₛ
has shape[256, 19, 19]
).y₂ <- relu(C(W₂, y₁) + x + b₂)
y₂
I suspect this is too expensive to do in practice (no good support for it in cuDNN), but it is a very interesting idea.