Closed hi-ogawa closed 3 years ago
I upgraded my local version of emscripten and fixed small errors/warnings in this commit https://github.com/hi-ogawa/Stockfish/commit/728b8621d27742dc46b8ec9bd3e92eda9432cf09.
EDIT: Stockfish commit https://github.com/official-stockfish/Stockfish/commit/f90274d8 is the one just before the new net architecture commit https://github.com/official-stockfish/Stockfish/commit/e8d64af1.
Having some difficulty adopting changes from new net architecture commit https://github.com/official-stockfish/Stockfish/commit/e8d64af1. Build succeeds but evaluation doesn't match with official build. Specifically, ARCH=wasm wasm_simd_post_mvp=yes
and ARCH=wasm wasm_simd=yes
are incorrect, but ARCH=wasm
is working fine.
I will list some debugging plans here:
#if defined(USE_WASM_SIMD)
from certain layer file and identify which wasm simd kernel has issue.EDIT:
It seems affine_transform.h
with InputDimensions != PaddedInputDimensions
has some issue as it's verified that the code below works
const OutputType* propagate(
const TransformedFeatureType* transformedFeatures, char* buffer) const {
const auto input = previousLayer.propagate(
transformedFeatures, buffer + SelfBufferSize);
#if defined(USE_WASM_SIMD)
if constexpr (InputDimensions == PaddedInputDimensions) { // <<== Use wasm simd kernel only on this condition
// Simplify variable names (y = Ax + b)
static_assert(InputDimensions % 16 == 0);
constexpr int n = InputDimensions;
constexpr int m = OutputDimensions;
constexpr int n_stride = PaddedInputDimensions;
auto A = *reinterpret_cast<const int8_t(*)[m][n_stride]>(weights);
auto x = *reinterpret_cast<const uint8_t(*)[n]>(input);
auto b = *reinterpret_cast<const int32_t(*)[m]>(biases);
auto y = *reinterpret_cast<int32_t(*)[m]>(buffer);
emscripten_wasm_simd::affine<n, m, n_stride>(A, x, b, y);
return y;
}
#endif
....
Maybe overflow? Maybe SIMD alignment issue?
EDIT2:
I found a bug in dot4
in "wasm_simd.cpp" and resolved the issue.
// BEFORE
z0 = wasm_i32x4_add(z0, dot_i8x16(wasm_v128_load(&a[j + 0 * n]), xv));
z1 = wasm_i32x4_add(z1, dot_i8x16(wasm_v128_load(&a[j + 1 * n]), xv));
z2 = wasm_i32x4_add(z2, dot_i8x16(wasm_v128_load(&a[j + 2 * n]), xv));
z3 = wasm_i32x4_add(z3, dot_i8x16(wasm_v128_load(&a[j + 3 * n]), xv));
// AFTER
z0 = wasm_i32x4_add(z0, dot_i8x16(wasm_v128_load(&a[j + 0 * n_stride]), xv));
z1 = wasm_i32x4_add(z1, dot_i8x16(wasm_v128_load(&a[j + 1 * n_stride]), xv));
z2 = wasm_i32x4_add(z2, dot_i8x16(wasm_v128_load(&a[j + 2 * n_stride]), xv));
z3 = wasm_i32x4_add(z3, dot_i8x16(wasm_v128_load(&a[j + 3 * n_stride]), xv));
I finished resolving conflicts and rebased single commit on the latest upstream.
Thanks to Sopel for making strided affine implementation https://github.com/hi-ogawa/Stockfish/pull/7/files#diff-599f7ecb14d235a2360be139f960404f3849d9f1a320749969a3a8a71e58f3b6.
Currently test is done only by comparing "Node searched" statistics from bench 16 1 25 current depth NNUE
.
# build ARCH=x86-64-avx2
$ ./src/stockfish bench 16 1 25 current depth NNUE
Total time (ms) : 13303
Nodes searched : 11562893
Nodes/second : 869194
# emscripten_build ARCH=wasm wasm_simd_post_mvp=yes
$ node -e 'console.log(`node: ${process.version}\nv8: ${process.versions.v8}`)'
node: v16.4.2
v8: 9.1.269.36-node.14
$ node --experimental-wasm-{simd,threads} src/emscripten/public/uci.js bench 16 1 25 current depth NNUE
Total time (ms) : 28829
Nodes searched : 11562893
Nodes/second : 401085
EDIT:
One thing a bit too concerning is that initial loading time got worse
$ time node --experimental-wasm-{simd,threads} src/emscripten/public/uci.js quit
Stockfish 280721 by the Stockfish developers (see AUTHORS file)
real 0m2,546s
user 0m4,477s
sys 0m0,385s
comparing to the one from the issue https://github.com/hi-ogawa/Stockfish/issues/1
$ time /usr/bin/node --experimental-wasm-{simd,threads} --wasm-simd-post-mvp src/emscripten/public/uci.js uci > /dev/null
real 0m1.289s
user 0m3.346s
sys 0m0.295s
Possibly, emscripten upgrade might be the reason for this and not the doubling of embedded weights.
Regarding initial loading time, it turns out it depends on v8's wasm compiler strategy.
EDIT: --wasm-lazy-compilation
's start up looks slow from below "time" result, but actually it seems it's taking time when exiting node process. So, actual responsiveness are more-or-less same as --no-wasm-tier-up
mode.
# TurboFan only (default on node?)
$ time node --liftoff --wasm-tier-up --experimental-wasm-{simd,threads} src/emscripten/public/uci.js quit > /dev/null
real 0m2,497s
user 0m4,479s
sys 0m0,386s
# Liftoff only
$ time node --liftoff --no-wasm-tier-up --experimental-wasm-{simd,threads} src/emscripten/public/uci.js quit > /dev/null
real 0m0,562s
user 0m0,517s
sys 0m0,133s
# Lazy tier up (default on browser?)
$ time node --wasm-lazy-compilation --experimental-wasm-{simd,threads} src/emscripten/public/uci.js quit > /dev/null
real 0m2,210s
user 0m2,781s
sys 0m0,304s
$ node --liftoff --wasm-tier-up --experimental-wasm-{simd,threads} src/emscripten/public/uci.js bench 16 1 25 current depth NNUE
Total time (ms) : 29388
Nodes searched : 11562893
Nodes/second : 393456
$ node --liftoff --no-wasm-tier-up --experimental-wasm-{simd,threads} src/emscripten/public/uci.js bench 16 1 25 current depth NNUE
Total time (ms) : 50598
Nodes searched : 11562893
Nodes/second : 228524
$ node --wasm-lazy-compilation --experimental-wasm-{simd,threads} src/emscripten/public/uci.js bench 16 1 25 current depth NNUE
Total time (ms) : 28169
Nodes searched : 11562893
Nodes/second : 410482
References
Except the problem about the initial loading speed (which is still tracked by the issue https://github.com/hi-ogawa/Stockfish/issues/1), applying the changes from Stockfish 14 seems went well.
I switched the default branch to emscripten-237ed1ef-2.0.26 and published a new package as 1.0.0-alpha.2
, so I'll close this issue.
basic_streambuf.sgetn
~emscripten-<upstream-hash>-<emscripten-version>
like this emscripten-f90274d8-2.0.11)engine_info
i32x4.dot_i16x8_s
is finalized to be in the standard feature--wasm-simd-post-mvp
flag anymore even more--experimental-wasm-{simd,threads}
not even necessary...this.onmessage
toself.onmessage
in worker https://github.com/emscripten-core/emscripten/pull/14137