proposal: wasm: use Asyncify for switching goroutines

neelance commented 3 years ago

WebAssembly still has no support for jump instructions or funclets. All control flow between basic blocks has to be expressed via if/switch and loop constructs. This means that a CFG with arbitrary control flow can not be directly expressed in wasm. (I still haven't seen a good explanation for this design choice except that V8 supposedly would have a hard time supporting arbitrary control flow).

The advertised solution for this limitation is to use the Relooper algorithm to turn arbitrary control flow back into structured control flow. Unfortunately this approach is incompatible with unwinding and restoring the stack when switching goroutines. This is why Go is not using the Relooper but instead quite naively uses a jump variable and a jump table at the beginning of each function (a loop and a huge switch). The performance overhead of this approach is quite bad.

In the long term, wasm stack switching might offer a solution that can avoid unwinding the stack. Until then, it has come to my attention that there is an intermediate solution called Asyncify. This is a post-processing pass on a wasm binary that adds stack unwinding but does so in a much more optimized way than the Go compiler currently does.

I propose using Asyncify within the Go compiler. This should give a very significant performance gain. As a consequence we would need to also implement the Relooper algorithm or some newer alternative called the Stackifier. A downside would be that it would add a compile time depencency to Binaryen's wasm-opt command which implements Asyncify. This dependency would go away when wasm stack switching is available.

For more context see https://github.com/WebAssembly/design/issues/796, especially https://github.com/WebAssembly/design/issues/796#issuecomment-739334778.

eliasnaur commented 3 years ago

A downside would be that it would add a compile time depencency to Binaryen's wasm-opt command which implements Asyncify. This dependency would go away when wasm stack switching is available.

Do you mean "compile time" as in during go tool compile, or "link time" as in go tool link? If the latter, it may be worth making wasm-opt optional by repurposing -linkmode external to use wasm-opt and -linkmode internal to use the old, inefficient approach.

It would be a pity to loose the ability to build WASM binaries without any dependencies. How much more work is implementing Asyncify in Go, compared to the (I assume) significant work to implement Relooper or Stackifier?

neelance commented 3 years ago

Do you mean "compile time" as in during go tool compile, or "link time" as in go tool link? If the latter, it may be worth making wasm-opt optional by repurposing -linkmode external to use wasm-opt and -linkmode internal to use the old, inefficient approach.

You are right, my wording was not specific enough. It is "link time", right before emitting the final wasm binary.

Regarding supporting both approaches at once: A big maybe. The Asyncify approach is quite different from what we're doing right now, for example it restores wasm locals. So we would need to also extend our current approach to still be able to use the benefits of Asyncify.

It would be a pity to loose the ability to build WASM binaries without any dependencies. How much more work is implementing Asyncify in Go, compared to the (I assume) significant work to implement Relooper or Stackifier?

They may each use quite a number of helper functions for analysis. I can't say how they compare in terms of total implementation complexity.

What I know for sure is that there is no way around implementing Relooper/Stackifier unless the WebAssembly project reverses course and implements jump/funclets, which seems very unlikely. With Asyncify however there is the option to use it and it is quite likely that it (or any work on our own implementation) will become obsolete once wasm stack switching lands. So from a pure practical perspective I myself am not really willing to spend time on reimplementing Asyncify.

cherrymui commented 3 years ago

Last time I looked (quite a long time ago), Asyncify is mostly suitable to handle unwinding/rewinding at specific locations that are known statically. For Go, however, the unwinding/rewinding is more dynamic, or, if we want to mark them statically, potentially nearly all call sites. Does it work well for our use case? It could be that I didn't understand it correctly, or things have changed. Sorry if that's the case.

neelance commented 3 years ago

Another reason Asyncify is fast is that we analyze the entire program’s call graph to see which functions might unwind, so that we can avoid modifying functions that can’t.

Yes, we need to mark all places that can unwind, for example channel operations. Asyncify should then be able to figure out which call sites may also unwind. I am not sure how many call sites would be affected. Do you have some idea?

cherrymui commented 3 years ago

A stack switch can happen if a goroutine needs to grow its stack. This can happen at the entry of nearly all functions (except nosplit functions). So I think nearly all call sites would be affected.

neelance commented 3 years ago

This is a good question. I hadn't considered it yet. Yes, it switches the Go stack, but do we also have to switch the WebAssembly stack? Couldn't we simply continue after the call to morestack?

cherrymui commented 3 years ago

Couldn't we simply continue after the call to morestack?

That could be a possibility. That would make morestack a bit different from other architectures. For example, currently morestacknever returns, it just does a context switch to the goroutine stack. I don't know what complexity it will be to change that.

Also, once we have threads, we can have preemptions. Preemption check is also at function entries, and morestack is dual-functioned. In this case, we'll do a real goroutine switch.

neelance commented 3 years ago

Also, once we have threads, we can have preemptions. Preemption check is also at function entries, and morestack is dual-functioned. In this case, we'll do a real goroutine switch.

Yes, I fully agree with this conclusion. Using Asyncify would prevent us from doing preemption like Go usually does.

kripken commented 3 years ago

If it would be useful, I could look into porting the Binaryen CFG code (which is ~2,000 lines of C++ that I wrote, and that uses an improved Relooper algorithm) to Go. (I've only just tinkered with Go so far, however, so it would be my first production code...)

neelance commented 3 years ago

If it would be useful, I could look into porting the Binaryen CFG code (which is ~2,000 lines of C++ that I wrote, and that uses an improved Relooper algorithm) to Go.

Thanks for the offer! ~That way we wouldn't need a dependency to wasm-opt.~ (edit: I misunderstood, see below)

kripken commented 3 years ago

Oh sorry, maybe I wasn't clear: That would just handle the relooping. You would still need the asyncify logic as well. That's much larger - just the asyncify pass itself is almost 2,000 lines, and it depends on the rest of the Binaryen optimizer which is an order of magnitude bigger at least.

A relooper in Go would let you emit good code for the case where you don't need stack switching (can Go emit such programs?), or where you use wasm stack switching in the future. But otherwise you'd still need wasm-opt for asyncify.

kripken commented 2 years ago

A relevant update here: there is now a JavaScript Promise Integration (JSPI) proposal for wasm, which basically does what wasm-opt's Asyncify does, but in the wasm VM. That is, Asyncify is like a polyfill for that new feature.

JSPI is not a full wasm stack switching proposal, but it is the first part of that support. It relies on JS Promises, so it is not pure wasm and will have some overhead (as JS must be used to swap between running and paused code). But it has far less overhead than Asyncify (and I believe, the current implementation Go uses in wasm).

JSPI is still experimental, but there is a working prototype in V8 and there is successful integration in Emscripten for example. So this could be a good time to experiment with it.

In particular, if we experiment with that successfully, then the other part of the plan from earlier in this issue could make sense again: We could port a small Relooper implementation to Go, and then that + JSPI should let Go emit pretty efficient wasm I believe. Thoughts?

golang / go

proposal: wasm: use Asyncify for switching goroutines #43033