Closed glinscott closed 6 years ago
Do you have an idea how much of gain this will give? I'm not familiar with winograd.
From https://github.com/gcp/leela-zero/issues/305, it appears to be roughly 2x faster. So a pretty significant win!
Ok this is amazing! https://arxiv.org/pdf/1509.09308.pdf
On people with an Intel iGPU the gain is an order of magnitude. (The OpenCL code for direct convolution was designed for "real" GPUs with a SIMT model, whereas the new code tunes the computing kernel to the device on first launch)
Winograd is now promoted to "next": https://github.com/gcp/leela-zero/tree/next
Cool! Thanks for the heads up @gcp.
Ported it over, need to fix this asan bug (leaving as a note to myself):
==65008==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x63200002d200 at pc 0x000100e27e31 bp 0x7ffeeedec670 sp 0x7ffeeedec668
WRITE of size 4 at 0x63200002d200 thread T0
#0 0x100e27e30 in Network::winograd_transform_in(std::__1::vector<float, std::__1::allocator<float> > const&, std::__1::vector<float, std::__1::allocator<float> >&, int) Network.cpp:489
#1 0x100e2a280 in Network::winograd_convolve3(int, std::__1::vector<float, std::__1::allocator<float> > const&, std::__1::vector<float, std::__1::allocator<float> > const&, std::__1::vector<float, std::__1::allocator<float> >&, std::__1::vector<float, std::__1::allocator<float> >&, std::__1::vector<float, std::__1::allocator<float> >&) Network.cpp:593
#2 0x100e2aaf5 in Network::forward_cpu(std::__1::vector<float, std::__1::allocator<float> >&, std::__1::vector<float, std::__1::allocator<float> >&) Network.cpp:719
#3 0x100e3696d in Network::get_scored_moves_internal(BoardHistory const&, Network::NNPlanes&, Network::DebugRawData*) Network.cpp:856
#4 0x100e30cda in Network::get_scored_moves(BoardHistory const&, Network::DebugRawData*) Network.cpp:813
#5 0x100ec6694 in UCTNode::create_children(std::__1::atomic<int>&, BoardHistory const&, float&) UCTNode.cpp:75
#6 0x100e8eaa5 in UCTSearch::play_simulation(BoardHistory&, UCTNode*) UCTSearch.cpp:71
#7 0x100e8ee0e in UCTSearch::play_simulation(BoardHistory&, UCTNode*) UCTSearch.cpp:85
#8 0x100e8ee0e in UCTSearch::play_simulation(BoardHistory&, UCTNode*) UCTSearch.cpp:85
#9 0x100e8ee0e in UCTSearch::play_simulation(BoardHistory&, UCTNode*) UCTSearch.cpp:85
#10 0x100e93f44 in UCTSearch::think() UCTSearch.cpp:254
#11 0x100f865fd in bench() main.cpp:256
#12 0x100f8ac2b in main main.cpp:357
#13 0x7fff77425114 in start (libdyld.dylib:x86_64+0x1114)
0x63200002d200 is located 2560 bytes to the right of 81920-byte region [0x632000018800,0x63200002c800)
allocated by thread T0 here:
#0 0x1015620ab in wrap__Znwm (libclang_rt.asan_osx_dynamic.dylib:x86_64h+0x640ab)
#1 0x100e52de9 in std::__1::vector<float, std::__1::allocator<float> >::allocate(unsigned long) new:226
#2 0x100e52ac5 in std::__1::vector<float, std::__1::allocator<float> >::vector(unsigned long) vector:1068
#3 0x100e0d7ac in std::__1::vector<float, std::__1::allocator<float> >::vector(unsigned long) vector:1062
#4 0x100e2aa02 in Network::forward_cpu(std::__1::vector<float, std::__1::allocator<float> >&, std::__1::vector<float, std::__1::allocator<float> >&) Network.cpp:716
#5 0x100e3696d in Network::get_scored_moves_internal(BoardHistory const&, Network::NNPlanes&, Network::DebugRawData*) Network.cpp:856
#6 0x100e30cda in Network::get_scored_moves(BoardHistory const&, Network::DebugRawData*) Network.cpp:813
#7 0x100ec6694 in UCTNode::create_children(std::__1::atomic<int>&, BoardHistory const&, float&) UCTNode.cpp:75
#8 0x100e8eaa5 in UCTSearch::play_simulation(BoardHistory&, UCTNode*) UCTSearch.cpp:71
#9 0x100e8ee0e in UCTSearch::play_simulation(BoardHistory&, UCTNode*) UCTSearch.cpp:85
#10 0x100e8ee0e in UCTSearch::play_simulation(BoardHistory&, UCTNode*) UCTSearch.cpp:85
#11 0x100e8ee0e in UCTSearch::play_simulation(BoardHistory&, UCTNode*) UCTSearch.cpp:85
#12 0x100e93f44 in UCTSearch::think() UCTSearch.cpp:254
#13 0x100f865fd in bench() main.cpp:256
#14 0x100f8ac2b in main main.cpp:357
#15 0x7fff77425114 in start (libdyld.dylib:x86_64+0x1114)
SUMMARY: AddressSanitizer: heap-buffer-overflow Network.cpp:489 in Network::winograd_transform_in(std::__1::vector<float, std::__1::allocator<float> > const&, std::__1::vector<float, std::__1::allocator<float> >&, int)
Shadow bytes around the buggy address:
0x1c64000059f0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x1c6400005a00: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x1c6400005a10: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x1c6400005a20: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x1c6400005a30: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
=>0x1c6400005a40:[fa]fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x1c6400005a50: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x1c6400005a60: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x1c6400005a70: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x1c6400005a80: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x1c6400005a90: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
Shadow byte legend (one shadow byte represents 8 application bytes):
Addressable: 00
Partially addressable: 01 02 03 04 05 06 07
Heap left redzone: fa
Freed heap region: fd
Stack left redzone: f1
Stack mid redzone: f2
Stack right redzone: f3
Stack after return: f5
Stack use after scope: f8
Global redzone: f9
Global init order: f6
Poisoned by user: f7
Container overflow: fc
Array cookie: ac
Intra object redzone: bb
ASan internal: fe
Left alloca redzone: ca
Right alloca redzone: cb
==65008==ABORTING
Abort trap: 6
@gcp I had to change WINOGRAD_P to W*H/2=32
instead of (W+1)*(H+1)/4=20
to get things to work. Now the winograd GPU output matches master branch (bench worked :).
However, CPU is still busted.
Winograd works on 4x4 input tiles with an overlap of 2 in each direction. Go boards are 19x19, so that needs padding to 20x20. The overlap of 2 in each direction causes the division by 4.
Chess with 8x8 doesn't need padding. But I do think you only need 16 tiles? The divisor shouldn't change?
I have now validated that the GPU/CPU winograd paths exactly match the output on bench from the previous GPU/CPU paths.
It required this piece of magic though: https://github.com/glinscott/leela-chess/commit/0d9dc18db1b04c49a3b277cbd4c436cb610923c4#diff-4bf813f8583371d38812a265361e016fR716
@gcp you were right, it only needs 16 tiles. However, somehow that meant the V array was getting sized too small.
Once it's stabilized, currently at https://github.com/gcp/leela-zero/tree/winograd