Adds support for WASM 128-bit SIMD

almann commented 2 years ago

There are some TODOs (see below), but the implementation and test suite integration has been done.

Adds feature wasm32_simd128 consistent with aarch64_neon as there is no auto-detection for SIMD in WASM.
Updates README.md with WASM build instructions.
Integrates with tests--though the should_panic test case will be ignored due to limitations of the test driver with wasm32-wasi.
Adds ignore for IntelliJ based IDEs (e.g., CLion).

TODO

[x] Integrate with benchmarks. Naively targeting wasm32-wasi for the benchmarks have issues with both wasmer and wasmtime. I suspect this is partially because of criterion expecting wasm32 targets to run to have wasm-bindgen. One option would be to compile a shim of the library as a staticlib crate targeting wasm32-unknown-unknown and using wasmer or wasmtime to compile/link the function and benchmark it in the current framework.
[x] Validate inlining.
[x] Integrate with CI workflow scripts.
[ ] Run the fuzzer.

Implementation for #36

almann commented 2 years ago

I've added preliminary benchmark integrations in this commit.

The approach is basically what I posited in the TODO above--I cross compile a small C-ABI shim crate targeting wasm32-unknown-unknown into a cdylib and then embed the compiled WASM module into the executable. At runtime, the benchmark compiles the WASM module and copies the benchmark string into the instance's linear memory, and then proceeds to benchmark calling into the appropriate shim function through the WASM runtime's interface (which also has the minor downside of benchmarking the overhead of calling a function through the WASM runtime). This is good enough to make sure that the relative usage of SIMD is actually doing something and to measure the relative speed up on a given WASM runtime (mileage may vary in different VMs on different host architectures).

On my NUC (Intel i7-10710U running Ubuntu 20.04 with performance governor on) I got some promising benchmarks that demonstrate that the WASM SIMD is definitely performing better.

$ for X in std compat basic; do (cargo bench --features=simdutf8_wasm --bench="throughput_wasm_$X" -- --save-baseline wasm-$X); done

Do you want me to add this feature to this PR, or do you want me to break them up (e.g., have this PR stand on its own, and have follow up PRs and make the task list above issues)? I am very sympathetic large PRs so please let me know how you'd like me to proceed.

hkratz commented 2 years ago

First of all: Great work, thank you so much! I will do a first review shortly.

Do you want me to add this feature to this PR, or do you want me to break them up (e.g., have this PR stand on its own, and have follow up PRs and make the task list above issues)? I am very sympathetic large PRs so please let me know how you'd like me to proceed.

You can put it in this PR, just don't rebase for now, so I know which commits I have already looked at.

almann commented 2 years ago

You can put it in this PR, just don't rebase for now, so I know which commits I have already looked at.

Excellent, I have pushed that commit to the PR.

almann commented 2 years ago

What I am not 100% clear on is, if we want to provide a knob so that the library user can turn on the simd implementation at runtime without having to compile with the +simd128 target feature, because for browsers there is wasm-feature-detect which could be used to check for simd availability. What do you think?

My current understanding (which could be wrong) of WASM ISA extensions such as SIMD is is that there is no way within WASM for us to detect this, so the code targeting WASM itself cannot detect SIMD or not--in the host system (e.g., a Javascript host) there may or may not be such a facility, but it is not clear to me how we could reliably interact with it from the library in a portable (as in non-JS engines) way.

I do think from the library's perspective, we should definitely allow static flexibility at a minimum because ultimately we want to be able target any WASM runtime irrespective of its native capabilities.

almann commented 2 years ago

Added the CI including inlining tests, I had it configured in another branch to run with that branch, here was a run of the workflow:

https://github.com/almann/simdutf8/actions/runs/1655415697

hkratz commented 2 years ago

Looking good! It will be a few days until I have time to review the rest though.

hkratz commented 2 years ago

@almann Are you sure there is no way around the shim for benchmarking? It looks like there is some basic support for wasi benchmarking in the criterion master branch: https://github.com/bheisler/criterion.rs/issues/461#issuecomment-886634148.

almann commented 2 years ago

@almann Are you sure there is no way around the shim for benchmarking? It looks like there is some basic support for wasi benchmarking in the criterion master branch: bheisler/criterion.rs#461 (comment).

Diving into that issue and the associated PR in the downstream dependency (plotters), is that both are still open (and I think I saw this when I was trying to target the benchmarks to wasm32-wasi). The problem I had run into against released criterion.rs was that the downstream dependency for wasm32 targetted code assumes wasm-pack which leads to a bunch of undefined symbols. Looking more closely at the PR files, the patch looks pretty trivial and explains my experience, with the build.

@hkratz, did I maybe miss something here?

almann commented 2 years ago

I also added Wasmtime support to the benchmarks in https://github.com/rusticstuff/simdutf8/pull/56/commits/48eacc43460074f7b6fe2b6a00147bbe6fa0ab93.

I found actually benchmarks on the different WASM runtimes to be of interest, so assuming we keep the shim approach, I think these benchmarks are useful and a user can select whichever one (or all of them) as they see fit.

For example here is the comparison of Wasmer Cranelift/LLVM back-ends vs. Wasmtime (which only has a Cranelift backend) for the basic validator (on Ubuntu 20.04, Ryzen 3950X [Zen2]):

The caveat of the shim benchmarks apply, but it does suggest interesting differences between runtime and this code.

almann commented 2 years ago

Diving into that issue and the associated PR in the downstream dependency (plotters), is that both are still open (and I think I saw this when I was trying to target the benchmarks to wasm32-wasi). The problem I had run into against released criterion.rs was that the downstream dependency for wasm32 targetted code assumes wasm-pack which leads to a bunch of undefined symbols. Looking more closely at the PR files, the patch looks pretty trivial and explains my experience, with the build.

@hkratz, did I maybe miss something here?

FWIW, I tried playing around with this. I can definitely patch around plotters (check it out in a submodule, patch it with the PR above and make sure the version lines up), but I run into problems with rayon needing a thread-pool. A comment in the PR above alludes to rayon/plotters being optional, but I see no such evidence of this in criterion's TOML file:

https://github.com/bheisler/criterion.rs/blob/4e773a3b8523a73e7105e11f0b2d4b545827712e/Cargo.toml#L35

https://github.com/bheisler/criterion.rs/blob/4e773a3b8523a73e7105e11f0b2d4b545827712e/Cargo.toml#L42-L45

RUSTFLAGS="-C target-feature=+simd128" CARGO_TARGET_WASM32_WASI_RUNNER="wasm-runner wasmer" cargo bench --no-default-features --target=wasm32-wasi --bench=throughput_basic

...

thread 'main' panicked at 'The global thread pool has not been initialized.: ThreadPoolBuildError { kind: IOError(Error { kind: Unsupported, message: "operation not supported on this platform" }) }', /home/almann/.cargo/registry/src/github.com-1ecc6299db9ec823/rayon-core-1.9.1/src/registry.rs:170:10
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
error: failed to run `/home/almann/CLionProjects/simdutf8/bench/target/wasm32-wasi/release/deps/throughput_basic-716ee0bc7dd8261c.wasm`
│   1: RuntimeError: unreachable
           at __rust_start_panic (throughput_basic-716ee0bc7dd8261c.wasm[2750]:0x1dd651)
           at rust_panic (throughput_basic-716ee0bc7dd8261c.wasm[2728]:0x1dcd76)
           at std::panicking::rust_panic_with_hook::heefe90fce7240a03 (throughput_basic-716ee0bc7dd8261c.wasm[2721]:0x1dc9e3)
           at std::panicking::begin_panic_handler::{{closure}}::h4f5bcef500c8e46d (throughput_basic-716ee0bc7dd8261c.wasm[2702]:0x1dbab7)
           at std::sys_common::backtrace::__rust_end_short_backtrace::h53dbd65c57a9d0e6 (throughput_basic-716ee0bc7dd8261c.wasm[2701]:0x1dba06)
           at rust_begin_unwind (throughput_basic-716ee0bc7dd8261c.wasm[2720]:0x1dc531)
           at core::panicking::panic_fmt::h92e73467e6a2c091 (throughput_basic-716ee0bc7dd8261c.wasm[2894]:0x1eb224)
           at core::result::unwrap_failed::h03dbf875e0072fe2 (throughput_basic-716ee0bc7dd8261c.wasm[2921]:0x1ec477)
           at rayon_core::registry::global_registry::hb1f293f2c8dfd05b (throughput_basic-716ee0bc7dd8261c.wasm[1292]:0x12d0e8)
           at rayon_core::current_num_threads::h35bc7a6b8dbf8c63 (throughput_basic-716ee0bc7dd8261c.wasm[1303]:0x12e058)
           at rayon::iter::plumbing::bridge::ha8b8b23fcd9b38a6 (throughput_basic-716ee0bc7dd8261c.wasm[1021]:0x1058ae)
           at criterion::analysis::estimates::h6521ee9def717bc3 (throughput_basic-716ee0bc7dd8261c.wasm[659]:0xb4089)
           at criterion::analysis::common::h567d2c8475d0f44a (throughput_basic-716ee0bc7dd8261c.wasm[244]:0x22eb7)
           at criterion::benchmark_group::BenchmarkGroup<M>::bench_with_input::hb0a4d9a4d0aa4552 (throughput_basic-716ee0bc7dd8261c.wasm[254]:0x258b5)
           at simdutf8_bench::bench_input::hd027083377a798b4 (throughput_basic-716ee0bc7dd8261c.wasm[115]:0x11047)
           at simdutf8_bench::criterion_benchmark::hf2c8a1c63f585fec (throughput_basic-716ee0bc7dd8261c.wasm[116]:0x1143d)
           at throughput_basic::main::h9dc93b5f5255a651 (throughput_basic-716ee0bc7dd8261c.wasm[245]:0x24030)
           at std::sys_common::backtrace::__rust_begin_short_backtrace::h2c5b7aaf52faf128 (throughput_basic-716ee0bc7dd8261c.wasm[127]:0x1234b)
           at std::rt::lang_start::{{closure}}::h557e866ad9f65275 (throughput_basic-716ee0bc7dd8261c.wasm[147]:0x13ecd)
           at std::rt::lang_start_internal::h44883fd0d3691d00 (throughput_basic-716ee0bc7dd8261c.wasm[2697]:0x1db731)
           at __original_main (throughput_basic-716ee0bc7dd8261c.wasm[246]:0x240dd)
           at _start (throughput_basic-716ee0bc7dd8261c.wasm[20]:0x18e8)
           at _start.command_export (throughput_basic-716ee0bc7dd8261c.wasm[3016]:0x1f37d1)

almann commented 2 years ago

FWIW, The CI failure appears to be transient, I pushed to main on my fork and it ran okay.

https://github.com/almann/simdutf8/actions/runs/1674697191

Looking at the error--there was a problem downloading Wasmer, which didn't fail that step (but failed when attempting to run the tests as the runtime was missing):

   (node:1899) UnhandledPromiseRejectionWarning: RequestError: connect ECONNREFUSED 52.204.121.99:443
      at ClientRequest.<anonymous> (/Users/runner/work/_actions/wasmerio/setup-wasmer/v1/dist/index.js:1:104424)
      at Object.onceWrapper (events.js:300:26)
      at ClientRequest.emit (events.js:215:7)
      at ClientRequest.e.emit (/Users/runner/work/_actions/wasmerio/setup-wasmer/v1/dist/index.js:1:38640)
      at TLSSocket.socketErrorListener (_http_client.js:406:9)
      at TLSSocket.emit (events.js:210:5)
      at emitErrorNT (internal/streams/destroy.js:92:8)
      at emitErrorAndCloseNT (internal/streams/destroy.js:60:3)
      at processTicksAndRejections (internal/process/task_queues.js:80:21)
      at TCPConnectWrap.afterConnect [as oncomplete] (net.js:1129:14)
  (node:1899) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 1)
  (node:1899) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.

rusticstuff / simdutf8

Adds support for WASM 128-bit SIMD #56

TODO