Interpreter performance and footprint

ia0 commented 1 year ago

The current interpreter is very simple:

It doesn't pre-compute anything.
It executes directly from flash.

It is thus an in-place (non-optimized) interpreter according to A fast in-place interpreter for WebAssembly by Ben L. Titzer.

We would like to give users control over the tradeoff between performance and footprint (both in flash and memory). The following improvements would go towards this direction:

Implement the optimizations discussed in the paper above, and feature-gate those that don't improve both performance and footprint (like the sidetable).
Implement optimizations similar to what wasm3 does, by translating the opcodes into their function addresses, and feature-gate them.
Implement a feature to persist in flash the translated code in the optimization above. This optimization means that a platform update would invalidate the applet (unless the original bytecode is preserved). Such translated code wouldn't need validation and translation when starting and may execute directly.
JIT and AOT are probably out of the question for now. If wasmtime or wasmer end up supporting no-std, the situation might simplify. The difficulty being that they would need to run in a different thread to not take control of the scheduler thread. See #458
Use Pulley which principles and rationale seem to fit very well to our use-case.

Open questions:

Does it make sense to have a feature like toctou which would remove all dead panics? (not just those with a corresponding dynamic check, but all where the compiler doesn't prove it on its own)
How to benchmark? Which suite to use?
If there's a way to detect hot/cold functions and optimizations have a cost per function, only hot functions may be optimized.

Related work:

https://github.com/Neopallium/s1vm aims to be a Rust port of Wasm3, but didn't get any activity since 2 years.

The work is tracked in the dev/fast-interp branch.

zhouwfang commented 6 months ago

@ia0 What is the expected timeline for #458? I assume it might make this issue irrelevant?

ia0 commented 6 months ago

It depends when someone would take a look, but I expect that it won't work in the short term. And if it ever works, I don't expect it to be a replacement of the current interpreter, but just an alternative, the same way having a rewriting interpreter and a simple compiler would be alternatives to the current in-place interpreter. They all provide a different trade-off between performance, footprint, and portability. I'll update the issue with this alternative. EDIT: Actually, the issue was already mentioning Wasmtime as an option. I just linked the issue.

zhouwfang commented 6 months ago

Thanks for the clarification!

zhouwfang commented 5 months ago

@ia0 I was wondering if it is worth considering using wasmi.

In https://google.github.io/wasefire/faq.html#why-implement-a-new-interpreter, it says "wasmi consumes too much RAM for embedded". However, in the recent release wasmi has been migrated from stack-based IR to register-based IR, and the release doc writes, "The new register-based IR was carefully designed to enhance execution performance and to minimize memory usage. As the vast majority of a Wasm binary is comprised of encoded instructions, this substantially decreases memory usage and enhances cache efficiency when executing Wasm through Wasmi. [...] with a fantastic startup performance and low memory consumption especially suited for embedded environments."

(Of course, we would still need to implement streamed compilation to flash.)

ia0 commented 5 months ago

I was wondering if it is worth considering using wasmi.

It makes sense to give it a try if they claim being now suited for embedded environments. That would be one more comparison point along the following dimensions: interpreter code size, interpreter RAM usage, interpreter performance.

For testing purposes, let's first modify wasefire-scheduler in place to use wasmi instead wasefire-interpreter. This is just a quick and dirty solution to assess the vialibility of wasmi. If the results are good, we can create a feature to choose between both implementations.

Of course, we would still need to implement streamed compilation to flash

Yes, that's probably a necessary step, but it could be done in a second phase.

zhouwfang commented 5 months ago

First, just wanted to confirm my understanding about Scheduler::run(wasm): 1) scheduler.load(wasm) may call host functions, and these calls are handled by scheduler.process_applet() within scheduler.load(wasm). This means we would need to "fit" the wasmi's host functions API into scheduler.process_applet().

2) scheduler.flush_events() handles pending events from the board such as button presses.

For performance evaluation purpose, can we ignore host functions? In other words, can we only compare scheduler.load(wasm) and "wasmi.load(wasm)" without host functions in wasm and remove this infinite loop? Thanks.

ia0 commented 5 months ago

It's true that the need for an in-place interpreter (in particular regarding RAM usage) was not the only reason to write a custom interpreter. There was also the multiplexing at host-function level. That said, I believe this last part could be done otherwise as long as interpreters have a way to give back the control flow (e.g. using fuel in wasmi or epoch in wasmtime).

I guess the simplest to check performance is to use my riscv branch and add wasmi as a runtime. This should be easy, since it should behave similarly to wasm3.

zhouwfang commented 5 months ago

CoreMark results based on your riscv branch, with a linux docker container on my personal laptop:

wasmi CoreMark result: 692.2593 (in 17.595s) CoreMark result: 663.20996 (in 18.242s) CoreMark result: 620.90765 (in 19.381s) CoreMark result: 657.69806 (in 18.73s) CoreMark result: 689.52545 (in 17.623s)

wasm3 CoreMark result: 881.8342 (in 13.633s) CoreMark result: 865.6646 (in 13.928s) CoreMark result: 947.5407 (in 12.794s) CoreMark result: 951.55707 (in 12.729s) CoreMark result: 957.02106 (in 12.683s)

base CoreMark result: 25.864857 (in 19.304s) CoreMark result: 27.917364 (in 18.341s) CoreMark result: 27.032507 (in 18.921s) CoreMark result: 27.77392 (in 18.362s) CoreMark result: 27.847397 (in 18.577s)

wasmi looks quite competent. WDYT?

Edit: Another advantage of wasmi is that it supports streamed translation -- https://wasmi-labs.github.io/blog/posts/wasmi-v0.32/#non-streaming-translation

ia0 commented 5 months ago

Thanks! Can you push your branch on your fork? I would like to test on embedded devices, which is what matters.

Another advantage of wasmi is that it supports streamed translation

That's already something, but it's the same for the current interpreter. What is important is not only streamed translation, but also persistent translation (not to RAM, but to flash).

zhouwfang commented 5 months ago

Thanks! Can you push your branch on your fork? I would like to test on embedded devices, which is what matters.

Here is the branch with the wasmi runtime. Looking forward to the result on a embedded device.

That's already something, but it's the same for the current interpreter.

What do you mean by "the same"? Thanks.

ia0 commented 5 months ago

Thanks! So here are the results (linux is my machine and nordic is nRF52840):

target	runtime	coremark	time	code	RAM
linux	base	28.510336	17.991s
linux	wasm3	2592.5205	19.479s
linux	wasmi	1297.4375	23.742s
nordic	base	0.088684715	225.861s	136K	5416
nordic	wasm3
nordic	wasmi	3.394433	20.678s	912K	91960

We can see the speed up between wasmi and base is ~45x on linux and ~38x on nordic, so quite comparable in terms of order of magnitude. We also see that wasm3 is ~2x faster than wasmi, so I would expect something similar on nordic if wasm3 would compile there.

Also important to notice is that wasmi is ~7x bigger than base in terms of code size and ~17x bigger than base in terms of RAM usage. That's quite a big issue and a no-go to use as-is.

So I think we should instead implement the optimizations described by Ben Titzer in https://arxiv.org/abs/2205.01183 and redo this benchmark. I would expect between ~2x and ~10x improvement on coremark with little code and RAM increase.

What do you mean by "the same"? Thanks.

I mean for the validation step, which is done linearly. It's not pure streaming because it still assumes a slice as input, but it processes it linearly without this assumption. It wouldn't be a big change to fix that.

ia0 commented 5 months ago

By the way, once #523 is merged, could you create a PR to the dev/wasm-bench branch with your commit adding wasmi? This will be useful for future comparisons.

zhouwfang commented 4 months ago

Thanks for testing on nordic! (I should think harder about how to do that by myself in my remote work set-up.)

I just added the optimization for wasmi in #524 according to its documentation. On linux, it improves the CoreMark from ~660 to ~1100. Could you give this optimized wasmi another try on nordic? Thanks!

I'll look into the in-place optimization paper.

ia0 commented 4 months ago

Thanks! I think those optimizations make sense in general, so I enabled them for all in your PR. Here are the results (linux is not the same though, but nordic is the same):

target	runtime	coremark	time	code	RAM
linux	base	27.179453	18.802s
linux	wasm3	2169.0405	18.947s
linux	wasmi	1112.2234	28.174s
nordic	base	0.09126336	219.468s	144K	5416
nordic	wasm3
nordic	wasmi	4.488666	15.643s	820K	91960

We see some improvement, but the code size is still unacceptable.

By the way, wasmtime recently decided to support no-std https://github.com/bytecodealliance/wasmtime/issues/8341. Could you also add a similar runtime support for wasmtime as you did for wasmi? The code should be rather similar. I'm curious to see if it already works on nordic.

zhouwfang commented 4 months ago

On my linux docker container, the wasmtime coremark is ~20 times of the wasm3 one. Is this expected?

IIUC, JIT interpreters such as wasmtime are more suitable for compute-intensive wasm workloads, while rewriting interpreters are more suitable for translation-intensive workloads. I was wondering whether we should prioritize the optimization of execution time or translation time, or potentially both.

ia0 commented 4 months ago

Yes, Wasmtime is much faster because it's compiled. But if we can get a compiler that is small in code size and doesn't use too much RAM, then we take it.

Regarding prioritization, there's not a single goal. In the end we want the user to be able to choose the trade-off between performance, security, footprint, and portability. So we want to provide as much alternatives (that are different enough from each other) as possible.

zhouwfang commented 4 months ago

1) I found another wasm interpreter in Rust named stitch. According its README, it has similar Coremark result with wasm3 on Linux, and it relies on the LLVM optimizations on sibling calls on 64-bit platforms. But unfortunately, "Stitch currently does not run on 32-bit platforms. The reason for this is that I have not yet found a way to get LLVM to perform sibling call optimisation on these platforms (ideas welcome)." So it is probably not generally applicable to embedded for now. But there are some 64-bit embedded architectures. WDYT?

2) I looked more into the in-place optimization paper, and "[i]t is implemented using a macro assembler that generates x86-64 machine code", as "[it] allows for near-perfect register allocation and unlocks all possible dispatch and organization techniques". So the implementation seems heavily dependent on the architecture, and I guess a implementation in a high-level language like Rust may have worse performance. Is there a primary embedded architecture we want to support like risc-v? Or do we ideally want architecture independence?

ia0 commented 4 months ago

Indeed, this is too specific to high-end machines and won't work on embedded.
Yes, we won't be able to do all tricks done in Wizard (the language they use). However, I think we should be able to do the key techniques described at the beginning of the 3rd chapter (side-table and value stack). We should also be able to do the dispatch table.

google / wasefire

Interpreter performance and footprint #46