TODO for the next release with Stockfish 14 features

hi-ogawa commented 3 years ago

[x] setup docker-based ci
- [x] testing on browser runtime
[x] setup testing to check if evaluation matches with official build
- [x] compare bench command's "Nodes searched"
[x] Use node instead of python for utility script (e.g. local server, embedded_nnue.py etc...)
~investigate possiblity of initial loading speed up by implementing simd-enabled basic_streambuf.sgetn~
[x] understand and resolve current conflict with upstream
[x] squash port related commits and rebase on upstream (maybe not so often but from time to time)
[x] define branch/tag convention to identify upstream version and compatible emscripten version (e.g. emscripten-<upstream-hash>-<emscripten-version> like this emscripten-f90274d8-2.0.11)
- [x] mention this system in README.md
- [x] also change default branch accordingly (probably it will be "emscripten-237ed1ef-2.0.26")
[x] move frontend demo code to other repo
- see https://github.com/hi-ogawa/stockfish-nnue-wasm-demo
[x] improve README.md
- [x] budges: npm package, github action status
[x] rename to stockfish-nnue.wasm
- [x] redirect https://stockfish-nnue-js.vercel.app to https://stockfish-nnue-wasm.vercel.app
- see here https://github.com/hi-ogawa/Stockfish/wiki/Migrate-Vercel-page
[x] STC/LTC match with previous version
- [x] move match.yml to separate repository https://github.com/hi-ogawa/stockfish-nnue-wasm-match
[x] show upstream commit, emscripten version in engine_info
[x] remove "post simd" build since i32x4.dot_i16x8_s is finalized to be in the standard feature
- [x] node v16 doesn't even have --wasm-simd-post-mvp flag anymore even more --experimental-wasm-{simd,threads} not even necessary...
- [x] upgrade emscripten (I've been using 2.0.11 but currently 2.0.26) and fix build error
- [x] wasm op name changed from "widen" to "extend" https://github.com/WebAssembly/simd/pull/478
- [x] linker only flag warning https://github.com/emscripten-core/emscripten/pull/14182
- [x] this.onmessage to self.onmessage in worker https://github.com/emscripten-core/emscripten/pull/14137

hi-ogawa commented 3 years ago

I upgraded my local version of emscripten and fixed small errors/warnings in this commit https://github.com/hi-ogawa/Stockfish/commit/728b8621d27742dc46b8ec9bd3e92eda9432cf09.

EDIT: Stockfish commit https://github.com/official-stockfish/Stockfish/commit/f90274d8 is the one just before the new net architecture commit https://github.com/official-stockfish/Stockfish/commit/e8d64af1.

hi-ogawa commented 3 years ago

Having some difficulty adopting changes from new net architecture commit https://github.com/official-stockfish/Stockfish/commit/e8d64af1. Build succeeds but evaluation doesn't match with official build. Specifically, ARCH=wasm wasm_simd_post_mvp=yes and ARCH=wasm wasm_simd=yes are incorrect, but ARCH=wasm is working fine.

I will list some debugging plans here:

verify weights are read correctly by writing out and compare to the original
compare each layer's output values
remove #if defined(USE_WASM_SIMD) from certain layer file and identify which wasm simd kernel has issue.

EDIT: It seems affine_transform.h with InputDimensions != PaddedInputDimensions has some issue as it's verified that the code below works

    const OutputType* propagate(
        const TransformedFeatureType* transformedFeatures, char* buffer) const {
      const auto input = previousLayer.propagate(
          transformedFeatures, buffer + SelfBufferSize);

#if defined(USE_WASM_SIMD)
      if constexpr (InputDimensions == PaddedInputDimensions) {  // <<== Use wasm simd kernel only on this condition
        // Simplify variable names (y = Ax + b)
        static_assert(InputDimensions % 16 == 0);
        constexpr int n = InputDimensions;
        constexpr int m = OutputDimensions;
        constexpr int n_stride = PaddedInputDimensions;
        auto A = *reinterpret_cast<const int8_t(*)[m][n_stride]>(weights);
        auto x = *reinterpret_cast<const uint8_t(*)[n]>(input);
        auto b = *reinterpret_cast<const int32_t(*)[m]>(biases);
        auto y = *reinterpret_cast<int32_t(*)[m]>(buffer);
        emscripten_wasm_simd::affine<n, m, n_stride>(A, x, b, y);
        return y;
      }
#endif
    ....

Maybe overflow? Maybe SIMD alignment issue?

EDIT2:

I found a bug in dot4 in "wasm_simd.cpp" and resolved the issue.

// BEFORE
        z0 = wasm_i32x4_add(z0, dot_i8x16(wasm_v128_load(&a[j + 0 * n]), xv));
        z1 = wasm_i32x4_add(z1, dot_i8x16(wasm_v128_load(&a[j + 1 * n]), xv));
        z2 = wasm_i32x4_add(z2, dot_i8x16(wasm_v128_load(&a[j + 2 * n]), xv));
        z3 = wasm_i32x4_add(z3, dot_i8x16(wasm_v128_load(&a[j + 3 * n]), xv));

// AFTER
        z0 = wasm_i32x4_add(z0, dot_i8x16(wasm_v128_load(&a[j + 0 * n_stride]), xv));
        z1 = wasm_i32x4_add(z1, dot_i8x16(wasm_v128_load(&a[j + 1 * n_stride]), xv));
        z2 = wasm_i32x4_add(z2, dot_i8x16(wasm_v128_load(&a[j + 2 * n_stride]), xv));
        z3 = wasm_i32x4_add(z3, dot_i8x16(wasm_v128_load(&a[j + 3 * n_stride]), xv));

hi-ogawa commented 3 years ago

I finished resolving conflicts and rebased single commit on the latest upstream.

Thanks to Sopel for making strided affine implementation https://github.com/hi-ogawa/Stockfish/pull/7/files#diff-599f7ecb14d235a2360be139f960404f3849d9f1a320749969a3a8a71e58f3b6.

Currently test is done only by comparing "Node searched" statistics from bench 16 1 25 current depth NNUE.

# build ARCH=x86-64-avx2
$ ./src/stockfish bench 16 1 25 current depth NNUE
Total time (ms) : 13303
Nodes searched  : 11562893
Nodes/second    : 869194

# emscripten_build ARCH=wasm wasm_simd_post_mvp=yes
$ node -e 'console.log(`node: ${process.version}\nv8: ${process.versions.v8}`)'
node: v16.4.2
v8: 9.1.269.36-node.14
$ node --experimental-wasm-{simd,threads} src/emscripten/public/uci.js bench 16 1 25 current depth NNUE
Total time (ms) : 28829
Nodes searched  : 11562893
Nodes/second    : 401085

EDIT:

One thing a bit too concerning is that initial loading time got worse

$ time node --experimental-wasm-{simd,threads} src/emscripten/public/uci.js quit 
Stockfish 280721 by the Stockfish developers (see AUTHORS file)

real    0m2,546s
user    0m4,477s
sys     0m0,385s

comparing to the one from the issue https://github.com/hi-ogawa/Stockfish/issues/1

$ time /usr/bin/node --experimental-wasm-{simd,threads} --wasm-simd-post-mvp src/emscripten/public/uci.js uci > /dev/null

real    0m1.289s
user    0m3.346s
sys     0m0.295s

Possibly, emscripten upgrade might be the reason for this and not the doubling of embedded weights.

hi-ogawa commented 3 years ago

Regarding initial loading time, it turns out it depends on v8's wasm compiler strategy.

Time command

EDIT: --wasm-lazy-compilation's start up looks slow from below "time" result, but actually it seems it's taking time when exiting node process. So, actual responsiveness are more-or-less same as --no-wasm-tier-up mode.

# TurboFan only (default on node?)
$ time node --liftoff --wasm-tier-up --experimental-wasm-{simd,threads} src/emscripten/public/uci.js quit > /dev/null

real    0m2,497s
user    0m4,479s
sys     0m0,386s

# Liftoff only
$ time node --liftoff --no-wasm-tier-up --experimental-wasm-{simd,threads} src/emscripten/public/uci.js quit > /dev/null

real    0m0,562s
user    0m0,517s
sys     0m0,133s

# Lazy tier up (default on browser?)
$ time node --wasm-lazy-compilation --experimental-wasm-{simd,threads} src/emscripten/public/uci.js quit > /dev/null

real    0m2,210s
user    0m2,781s
sys     0m0,304s

Bench statistics

$ node --liftoff --wasm-tier-up --experimental-wasm-{simd,threads} src/emscripten/public/uci.js bench 16 1 25 current depth NNUE
Total time (ms) : 29388
Nodes searched  : 11562893
Nodes/second    : 393456

$ node --liftoff --no-wasm-tier-up --experimental-wasm-{simd,threads} src/emscripten/public/uci.js bench 16 1 25 current depth NNUE
Total time (ms) : 50598
Nodes searched  : 11562893
Nodes/second    : 228524

$ node --wasm-lazy-compilation --experimental-wasm-{simd,threads} src/emscripten/public/uci.js bench 16 1 25 current depth NNUE
Total time (ms) : 28169
Nodes searched  : 11562893
Nodes/second    : 410482

References

hi-ogawa commented 3 years ago

Except the problem about the initial loading speed (which is still tracked by the issue https://github.com/hi-ogawa/Stockfish/issues/1), applying the changes from Stockfish 14 seems went well. I switched the default branch to emscripten-237ed1ef-2.0.26 and published a new package as 1.0.0-alpha.2, so I'll close this issue.

hi-ogawa / Stockfish

TODO for the next release with Stockfish 14 features #8