Open ia0 opened 1 year ago
@ia0 What is the expected timeline for #458? I assume it might make this issue irrelevant?
It depends when someone would take a look, but I expect that it won't work in the short term. And if it ever works, I don't expect it to be a replacement of the current interpreter, but just an alternative, the same way having a rewriting interpreter and a simple compiler would be alternatives to the current in-place interpreter. They all provide a different trade-off between performance, footprint, and portability. I'll update the issue with this alternative. EDIT: Actually, the issue was already mentioning Wasmtime as an option. I just linked the issue.
Thanks for the clarification!
@ia0 I was wondering if it is worth considering using wasmi
.
In https://google.github.io/wasefire/faq.html#why-implement-a-new-interpreter, it says "wasmi
consumes too much RAM for embedded". However, in the recent release wasmi
has been migrated from stack-based IR to register-based IR, and the release doc writes, "The new register-based IR was carefully designed to enhance execution performance and to minimize memory usage. As the vast majority of a Wasm binary is comprised of encoded instructions, this substantially decreases memory usage and enhances cache efficiency when executing Wasm through Wasmi. [...] with a fantastic startup performance and low memory consumption especially suited for embedded environments."
(Of course, we would still need to implement streamed compilation to flash.)
I was wondering if it is worth considering using
wasmi
.
It makes sense to give it a try if they claim being now suited for embedded environments. That would be one more comparison point along the following dimensions: interpreter code size, interpreter RAM usage, interpreter performance.
For testing purposes, let's first modify wasefire-scheduler
in place to use wasmi
instead wasefire-interpreter
. This is just a quick and dirty solution to assess the vialibility of wasmi
. If the results are good, we can create a feature to choose between both implementations.
Of course, we would still need to implement streamed compilation to flash
Yes, that's probably a necessary step, but it could be done in a second phase.
First, just wanted to confirm my understanding about Scheduler::run(wasm)
:
1) scheduler.load(wasm)
may call host functions, and these calls are handled by scheduler.process_applet()
within scheduler.load(wasm)
. This means we would need to "fit" the wasmi
's host functions API into scheduler.process_applet()
.
2) scheduler.flush_events()
handles pending events from the board such as button presses.
For performance evaluation purpose, can we ignore host functions? In other words, can we only compare scheduler.load(wasm)
and "wasmi.load(wasm)
" without host functions in wasm and remove this infinite loop? Thanks.
It's true that the need for an in-place interpreter (in particular regarding RAM usage) was not the only reason to write a custom interpreter. There was also the multiplexing at host-function level. That said, I believe this last part could be done otherwise as long as interpreters have a way to give back the control flow (e.g. using fuel in wasmi or epoch in wasmtime).
I guess the simplest to check performance is to use my riscv
branch and add wasmi as a runtime. This should be easy, since it should behave similarly to wasm3.
CoreMark results based on your riscv
branch, with a linux docker container on my personal laptop:
wasmi
CoreMark result: 692.2593 (in 17.595s)
CoreMark result: 663.20996 (in 18.242s)
CoreMark result: 620.90765 (in 19.381s)
CoreMark result: 657.69806 (in 18.73s)
CoreMark result: 689.52545 (in 17.623s)
wasm3
CoreMark result: 881.8342 (in 13.633s)
CoreMark result: 865.6646 (in 13.928s)
CoreMark result: 947.5407 (in 12.794s)
CoreMark result: 951.55707 (in 12.729s)
CoreMark result: 957.02106 (in 12.683s)
base
CoreMark result: 25.864857 (in 19.304s)
CoreMark result: 27.917364 (in 18.341s)
CoreMark result: 27.032507 (in 18.921s)
CoreMark result: 27.77392 (in 18.362s)
CoreMark result: 27.847397 (in 18.577s)
wasmi
looks quite competent. WDYT?
Edit: Another advantage of wasmi
is that it supports streamed translation -- https://wasmi-labs.github.io/blog/posts/wasmi-v0.32/#non-streaming-translation
Thanks! Can you push your branch on your fork? I would like to test on embedded devices, which is what matters.
Another advantage of
wasmi
is that it supports streamed translation
That's already something, but it's the same for the current interpreter. What is important is not only streamed translation, but also persistent translation (not to RAM, but to flash).
Thanks! Can you push your branch on your fork? I would like to test on embedded devices, which is what matters.
Here is the branch with the wasmi
runtime. Looking forward to the result on a embedded device.
That's already something, but it's the same for the current interpreter.
What do you mean by "the same"? Thanks.
Thanks! So here are the results (linux is my machine and nordic is nRF52840):
target | runtime | coremark | time | code | RAM |
---|---|---|---|---|---|
linux | base | 28.510336 | 17.991s | ||
linux | wasm3 | 2592.5205 | 19.479s | ||
linux | wasmi | 1297.4375 | 23.742s | ||
nordic | base | 0.088684715 | 225.861s | 136K | 5416 |
nordic | wasm3 | ||||
nordic | wasmi | 3.394433 | 20.678s | 912K | 91960 |
We can see the speed up between wasmi and base is ~45x on linux and ~38x on nordic, so quite comparable in terms of order of magnitude. We also see that wasm3 is ~2x faster than wasmi, so I would expect something similar on nordic if wasm3 would compile there.
Also important to notice is that wasmi is ~7x bigger than base in terms of code size and ~17x bigger than base in terms of RAM usage. That's quite a big issue and a no-go to use as-is.
So I think we should instead implement the optimizations described by Ben Titzer in https://arxiv.org/abs/2205.01183 and redo this benchmark. I would expect between ~2x and ~10x improvement on coremark with little code and RAM increase.
What do you mean by "the same"? Thanks.
I mean for the validation step, which is done linearly. It's not pure streaming because it still assumes a slice as input, but it processes it linearly without this assumption. It wouldn't be a big change to fix that.
By the way, once #523 is merged, could you create a PR to the dev/wasm-bench
branch with your commit adding wasmi
? This will be useful for future comparisons.
Thanks for testing on nordic! (I should think harder about how to do that by myself in my remote work set-up.)
I just added the optimization for wasmi in #524 according to its documentation. On linux, it improves the CoreMark from ~660 to ~1100. Could you give this optimized wasmi
another try on nordic? Thanks!
I'll look into the in-place optimization paper.
Thanks! I think those optimizations make sense in general, so I enabled them for all in your PR. Here are the results (linux is not the same though, but nordic is the same):
target | runtime | coremark | time | code | RAM |
---|---|---|---|---|---|
linux | base | 27.179453 | 18.802s | ||
linux | wasm3 | 2169.0405 | 18.947s | ||
linux | wasmi | 1112.2234 | 28.174s | ||
nordic | base | 0.09126336 | 219.468s | 144K | 5416 |
nordic | wasm3 | ||||
nordic | wasmi | 4.488666 | 15.643s | 820K | 91960 |
We see some improvement, but the code size is still unacceptable.
By the way, wasmtime recently decided to support no-std https://github.com/bytecodealliance/wasmtime/issues/8341. Could you also add a similar runtime support for wasmtime as you did for wasmi? The code should be rather similar. I'm curious to see if it already works on nordic.
On my linux docker container, the wasmtime
coremark is ~20 times of the wasm3
one. Is this expected?
IIUC, JIT interpreters such as wasmtime
are more suitable for compute-intensive wasm
workloads, while rewriting interpreters are more suitable for translation-intensive workloads. I was wondering whether we should prioritize the optimization of execution time or translation time, or potentially both.
Yes, Wasmtime is much faster because it's compiled. But if we can get a compiler that is small in code size and doesn't use too much RAM, then we take it.
Regarding prioritization, there's not a single goal. In the end we want the user to be able to choose the trade-off between performance, security, footprint, and portability. So we want to provide as much alternatives (that are different enough from each other) as possible.
1) I found another wasm interpreter in Rust named stitch. According its README, it has similar Coremark result with wasm3
on Linux, and it relies on the LLVM optimizations on sibling calls on 64-bit platforms. But unfortunately, "Stitch currently does not run on 32-bit platforms. The reason for this is that I have not yet found a way to get LLVM to perform sibling call optimisation on these platforms (ideas welcome)." So it is probably not generally applicable to embedded for now. But there are some 64-bit embedded architectures. WDYT?
2) I looked more into the in-place optimization paper, and "[i]t is implemented using a macro assembler that generates x86-64
machine code", as "[it] allows for near-perfect register allocation and unlocks all possible dispatch and organization techniques". So the implementation seems heavily dependent on the architecture, and I guess a implementation in a high-level language like Rust may have worse performance. Is there a primary embedded architecture we want to support like risc-v
? Or do we ideally want architecture independence?
The current interpreter is very simple:
It is thus an in-place (non-optimized) interpreter according to A fast in-place interpreter for WebAssembly by Ben L. Titzer.
We would like to give users control over the tradeoff between performance and footprint (both in flash and memory). The following improvements would go towards this direction:
Open questions:
toctou
which would remove all dead panics? (not just those with a corresponding dynamic check, but all where the compiler doesn't prove it on its own)Related work:
The work is tracked in the
dev/fast-interp
branch.