hi-ogawa / Stockfish

stockfish-nnue.wasm
https://stockfish-nnue-wasm.vercel.app
GNU General Public License v3.0
97 stars 11 forks source link

TODO for the next release with Stockfish 14 features #8

Closed hi-ogawa closed 3 years ago

hi-ogawa commented 3 years ago
hi-ogawa commented 3 years ago

I upgraded my local version of emscripten and fixed small errors/warnings in this commit https://github.com/hi-ogawa/Stockfish/commit/728b8621d27742dc46b8ec9bd3e92eda9432cf09.

EDIT: Stockfish commit https://github.com/official-stockfish/Stockfish/commit/f90274d8 is the one just before the new net architecture commit https://github.com/official-stockfish/Stockfish/commit/e8d64af1.

hi-ogawa commented 3 years ago

Having some difficulty adopting changes from new net architecture commit https://github.com/official-stockfish/Stockfish/commit/e8d64af1. Build succeeds but evaluation doesn't match with official build. Specifically, ARCH=wasm wasm_simd_post_mvp=yes and ARCH=wasm wasm_simd=yes are incorrect, but ARCH=wasm is working fine.

I will list some debugging plans here:


EDIT: It seems affine_transform.h with InputDimensions != PaddedInputDimensions has some issue as it's verified that the code below works

    const OutputType* propagate(
        const TransformedFeatureType* transformedFeatures, char* buffer) const {
      const auto input = previousLayer.propagate(
          transformedFeatures, buffer + SelfBufferSize);

#if defined(USE_WASM_SIMD)
      if constexpr (InputDimensions == PaddedInputDimensions) {  // <<== Use wasm simd kernel only on this condition
        // Simplify variable names (y = Ax + b)
        static_assert(InputDimensions % 16 == 0);
        constexpr int n = InputDimensions;
        constexpr int m = OutputDimensions;
        constexpr int n_stride = PaddedInputDimensions;
        auto A = *reinterpret_cast<const int8_t(*)[m][n_stride]>(weights);
        auto x = *reinterpret_cast<const uint8_t(*)[n]>(input);
        auto b = *reinterpret_cast<const int32_t(*)[m]>(biases);
        auto y = *reinterpret_cast<int32_t(*)[m]>(buffer);
        emscripten_wasm_simd::affine<n, m, n_stride>(A, x, b, y);
        return y;
      }
#endif
    ....

Maybe overflow? Maybe SIMD alignment issue?


EDIT2:

I found a bug in dot4 in "wasm_simd.cpp" and resolved the issue.

// BEFORE
        z0 = wasm_i32x4_add(z0, dot_i8x16(wasm_v128_load(&a[j + 0 * n]), xv));
        z1 = wasm_i32x4_add(z1, dot_i8x16(wasm_v128_load(&a[j + 1 * n]), xv));
        z2 = wasm_i32x4_add(z2, dot_i8x16(wasm_v128_load(&a[j + 2 * n]), xv));
        z3 = wasm_i32x4_add(z3, dot_i8x16(wasm_v128_load(&a[j + 3 * n]), xv));

// AFTER
        z0 = wasm_i32x4_add(z0, dot_i8x16(wasm_v128_load(&a[j + 0 * n_stride]), xv));
        z1 = wasm_i32x4_add(z1, dot_i8x16(wasm_v128_load(&a[j + 1 * n_stride]), xv));
        z2 = wasm_i32x4_add(z2, dot_i8x16(wasm_v128_load(&a[j + 2 * n_stride]), xv));
        z3 = wasm_i32x4_add(z3, dot_i8x16(wasm_v128_load(&a[j + 3 * n_stride]), xv));
hi-ogawa commented 3 years ago

I finished resolving conflicts and rebased single commit on the latest upstream.

Thanks to Sopel for making strided affine implementation https://github.com/hi-ogawa/Stockfish/pull/7/files#diff-599f7ecb14d235a2360be139f960404f3849d9f1a320749969a3a8a71e58f3b6.

Currently test is done only by comparing "Node searched" statistics from bench 16 1 25 current depth NNUE.

# build ARCH=x86-64-avx2
$ ./src/stockfish bench 16 1 25 current depth NNUE
Total time (ms) : 13303
Nodes searched  : 11562893
Nodes/second    : 869194

# emscripten_build ARCH=wasm wasm_simd_post_mvp=yes
$ node -e 'console.log(`node: ${process.version}\nv8: ${process.versions.v8}`)'
node: v16.4.2
v8: 9.1.269.36-node.14
$ node --experimental-wasm-{simd,threads} src/emscripten/public/uci.js bench 16 1 25 current depth NNUE
Total time (ms) : 28829
Nodes searched  : 11562893
Nodes/second    : 401085

EDIT:

One thing a bit too concerning is that initial loading time got worse

$ time node --experimental-wasm-{simd,threads} src/emscripten/public/uci.js quit 
Stockfish 280721 by the Stockfish developers (see AUTHORS file)

real    0m2,546s
user    0m4,477s
sys     0m0,385s

comparing to the one from the issue https://github.com/hi-ogawa/Stockfish/issues/1

$ time /usr/bin/node --experimental-wasm-{simd,threads} --wasm-simd-post-mvp src/emscripten/public/uci.js uci > /dev/null

real    0m1.289s
user    0m3.346s
sys     0m0.295s

Possibly, emscripten upgrade might be the reason for this and not the doubling of embedded weights.

hi-ogawa commented 3 years ago

Regarding initial loading time, it turns out it depends on v8's wasm compiler strategy.

EDIT: --wasm-lazy-compilation's start up looks slow from below "time" result, but actually it seems it's taking time when exiting node process. So, actual responsiveness are more-or-less same as --no-wasm-tier-up mode.

# TurboFan only (default on node?)
$ time node --liftoff --wasm-tier-up --experimental-wasm-{simd,threads} src/emscripten/public/uci.js quit > /dev/null

real    0m2,497s
user    0m4,479s
sys     0m0,386s

# Liftoff only
$ time node --liftoff --no-wasm-tier-up --experimental-wasm-{simd,threads} src/emscripten/public/uci.js quit > /dev/null

real    0m0,562s
user    0m0,517s
sys     0m0,133s

# Lazy tier up (default on browser?)
$ time node --wasm-lazy-compilation --experimental-wasm-{simd,threads} src/emscripten/public/uci.js quit > /dev/null

real    0m2,210s
user    0m2,781s
sys     0m0,304s
$ node --liftoff --wasm-tier-up --experimental-wasm-{simd,threads} src/emscripten/public/uci.js bench 16 1 25 current depth NNUE
Total time (ms) : 29388
Nodes searched  : 11562893
Nodes/second    : 393456

$ node --liftoff --no-wasm-tier-up --experimental-wasm-{simd,threads} src/emscripten/public/uci.js bench 16 1 25 current depth NNUE
Total time (ms) : 50598
Nodes searched  : 11562893
Nodes/second    : 228524

$ node --wasm-lazy-compilation --experimental-wasm-{simd,threads} src/emscripten/public/uci.js bench 16 1 25 current depth NNUE
Total time (ms) : 28169
Nodes searched  : 11562893
Nodes/second    : 410482

References

hi-ogawa commented 3 years ago

Except the problem about the initial loading speed (which is still tracked by the issue https://github.com/hi-ogawa/Stockfish/issues/1), applying the changes from Stockfish 14 seems went well. I switched the default branch to emscripten-237ed1ef-2.0.26 and published a new package as 1.0.0-alpha.2, so I'll close this issue.