Port winograd implementation from leela-zero

glinscott commented 6 years ago

Once it's stabilized, currently at https://github.com/gcp/leela-zero/tree/winograd

Error323 commented 6 years ago

Do you have an idea how much of gain this will give? I'm not familiar with winograd.

glinscott commented 6 years ago

From https://github.com/gcp/leela-zero/issues/305, it appears to be roughly 2x faster. So a pretty significant win!

Error323 commented 6 years ago

Ok this is amazing! https://arxiv.org/pdf/1509.09308.pdf

gcp commented 6 years ago

On people with an Intel iGPU the gain is an order of magnitude. (The OpenCL code for direct convolution was designed for "real" GPUs with a SIMT model, whereas the new code tunes the computing kernel to the device on first launch)

gcp commented 6 years ago

Winograd is now promoted to "next": https://github.com/gcp/leela-zero/tree/next

glinscott commented 6 years ago

Cool! Thanks for the heads up @gcp.

glinscott commented 6 years ago

Ported it over, need to fix this asan bug (leaving as a note to myself):

==65008==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x63200002d200 at pc 0x000100e27e31 bp 0x7ffeeedec670 sp 0x7ffeeedec668
WRITE of size 4 at 0x63200002d200 thread T0
    #0 0x100e27e30 in Network::winograd_transform_in(std::__1::vector<float, std::__1::allocator<float> > const&, std::__1::vector<float, std::__1::allocator<float> >&, int) Network.cpp:489
    #1 0x100e2a280 in Network::winograd_convolve3(int, std::__1::vector<float, std::__1::allocator<float> > const&, std::__1::vector<float, std::__1::allocator<float> > const&, std::__1::vector<float, std::__1::allocator<float> >&, std::__1::vector<float, std::__1::allocator<float> >&, std::__1::vector<float, std::__1::allocator<float> >&) Network.cpp:593
    #2 0x100e2aaf5 in Network::forward_cpu(std::__1::vector<float, std::__1::allocator<float> >&, std::__1::vector<float, std::__1::allocator<float> >&) Network.cpp:719
    #3 0x100e3696d in Network::get_scored_moves_internal(BoardHistory const&, Network::NNPlanes&, Network::DebugRawData*) Network.cpp:856
    #4 0x100e30cda in Network::get_scored_moves(BoardHistory const&, Network::DebugRawData*) Network.cpp:813
    #5 0x100ec6694 in UCTNode::create_children(std::__1::atomic<int>&, BoardHistory const&, float&) UCTNode.cpp:75
    #6 0x100e8eaa5 in UCTSearch::play_simulation(BoardHistory&, UCTNode*) UCTSearch.cpp:71
    #7 0x100e8ee0e in UCTSearch::play_simulation(BoardHistory&, UCTNode*) UCTSearch.cpp:85
    #8 0x100e8ee0e in UCTSearch::play_simulation(BoardHistory&, UCTNode*) UCTSearch.cpp:85
    #9 0x100e8ee0e in UCTSearch::play_simulation(BoardHistory&, UCTNode*) UCTSearch.cpp:85
    #10 0x100e93f44 in UCTSearch::think() UCTSearch.cpp:254
    #11 0x100f865fd in bench() main.cpp:256
    #12 0x100f8ac2b in main main.cpp:357
    #13 0x7fff77425114 in start (libdyld.dylib:x86_64+0x1114)

0x63200002d200 is located 2560 bytes to the right of 81920-byte region [0x632000018800,0x63200002c800)
allocated by thread T0 here:
    #0 0x1015620ab in wrap__Znwm (libclang_rt.asan_osx_dynamic.dylib:x86_64h+0x640ab)
    #1 0x100e52de9 in std::__1::vector<float, std::__1::allocator<float> >::allocate(unsigned long) new:226
    #2 0x100e52ac5 in std::__1::vector<float, std::__1::allocator<float> >::vector(unsigned long) vector:1068
    #3 0x100e0d7ac in std::__1::vector<float, std::__1::allocator<float> >::vector(unsigned long) vector:1062
    #4 0x100e2aa02 in Network::forward_cpu(std::__1::vector<float, std::__1::allocator<float> >&, std::__1::vector<float, std::__1::allocator<float> >&) Network.cpp:716
    #5 0x100e3696d in Network::get_scored_moves_internal(BoardHistory const&, Network::NNPlanes&, Network::DebugRawData*) Network.cpp:856
    #6 0x100e30cda in Network::get_scored_moves(BoardHistory const&, Network::DebugRawData*) Network.cpp:813
    #7 0x100ec6694 in UCTNode::create_children(std::__1::atomic<int>&, BoardHistory const&, float&) UCTNode.cpp:75
    #8 0x100e8eaa5 in UCTSearch::play_simulation(BoardHistory&, UCTNode*) UCTSearch.cpp:71
    #9 0x100e8ee0e in UCTSearch::play_simulation(BoardHistory&, UCTNode*) UCTSearch.cpp:85
    #10 0x100e8ee0e in UCTSearch::play_simulation(BoardHistory&, UCTNode*) UCTSearch.cpp:85
    #11 0x100e8ee0e in UCTSearch::play_simulation(BoardHistory&, UCTNode*) UCTSearch.cpp:85
    #12 0x100e93f44 in UCTSearch::think() UCTSearch.cpp:254
    #13 0x100f865fd in bench() main.cpp:256
    #14 0x100f8ac2b in main main.cpp:357
    #15 0x7fff77425114 in start (libdyld.dylib:x86_64+0x1114)

SUMMARY: AddressSanitizer: heap-buffer-overflow Network.cpp:489 in Network::winograd_transform_in(std::__1::vector<float, std::__1::allocator<float> > const&, std::__1::vector<float, std::__1::allocator<float> >&, int)
Shadow bytes around the buggy address:
  0x1c64000059f0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x1c6400005a00: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x1c6400005a10: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x1c6400005a20: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x1c6400005a30: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
=>0x1c6400005a40:[fa]fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x1c6400005a50: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x1c6400005a60: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x1c6400005a70: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x1c6400005a80: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x1c6400005a90: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07 
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
==65008==ABORTING
Abort trap: 6

glinscott commented 6 years ago

@gcp I had to change WINOGRAD_P to W*H/2=32 instead of (W+1)*(H+1)/4=20 to get things to work. Now the winograd GPU output matches master branch (bench worked :).

However, CPU is still busted.

glinscott commented 6 years ago

https://github.com/glinscott/leela-chess/commit/8e5dcddf600579ab4a91d8400a1251080f5f4adb

gcp commented 6 years ago

Winograd works on 4x4 input tiles with an overlap of 2 in each direction. Go boards are 19x19, so that needs padding to 20x20. The overlap of 2 in each direction causes the division by 4.

Chess with 8x8 doesn't need padding. But I do think you only need 16 tiles? The divisor shouldn't change?

glinscott commented 6 years ago

I have now validated that the GPU/CPU winograd paths exactly match the output on bench from the previous GPU/CPU paths.

It required this piece of magic though: https://github.com/glinscott/leela-chess/commit/0d9dc18db1b04c49a3b277cbd4c436cb610923c4#diff-4bf813f8583371d38812a265361e016fR716

@gcp you were right, it only needs 16 tiles. However, somehow that meant the V array was getting sized too small.

glinscott / leela-chess

Port winograd implementation from leela-zero #10