Open liutao-liu opened 9 months ago
cc @golang/wasm @golang/compiler
I manually modified the wat code to express the inner for
loop using the loop
operation, and the performance after aot compiled increased by 500%. (Added extra tests to make sure the function is working correctly.)
(func $main.testFor (type 0) (param i32) (result i32)
(local i32 i64 i64 i64 i64)
global.get 0
local.set 1
loop ;; label = @1
block ;; label = @2
block ;; label = @3
block ;; label = @4
block ;; label = @5
block ;; label = @6
block ;; label = @7
block ;; label = @8
block ;; label = @9
local.get 0
br_table 0 (;@9;) 1 (;@8;) 2 (;@7;) 3 (;@6;) 4 (;@5;) 5 (;@4;) 6 (;@3;) 7 (;@2;)
end
i64.const 0
local.set 2 ;; i = 0
i64.const 0
local.set 3 ;; sum = 0
i32.const 2
local.set 0
br 7 (;@1;)
end
local.get 2
i64.const 1
i64.add ;; i++
local.set 2
end
local.get 2
i64.const 200
i64.lt_s
i32.eqz
if ;; label = @7
i32.const 4
local.set 0
br 6 (;@1;)
end
end
i64.const 0
local.set 4
i32.const 6
local.set 0
br 4 (;@1;)
end
local.get 1
i64.extend_i32_u
i64.const 8
i64.add
i32.wrap_i64
local.get 3
i64.store
local.get 1
i32.const 8
i32.add
local.tee 1
global.set 0
i32.const 0
return
end
loop $myloop ;; inner for loop start
local.get 4 ;; get j
i64.const 1
i64.add
local.set 5
local.get 3
local.get 4
i64.add
local.set 3
local.get 5
local.set 4
local.get 4
i64.const 10000000
i64.lt_s
br_if $myloop ;; inner for loop end
end
end
local.get 4 ;; j
i64.const 10000000
i64.lt_s ;; j < 10000000
if ;; label = @3
i32.const 5
local.set 0
br 2 (;@1;)
end
i32.const 1
local.set 0
br 1 (;@1;)
end
end
unreachable)
The reason for the use of br_table is to support goroutines in WebAssembly's single threaded environment. The generated code is able to unwind and rewind the WebAssembly stack when goroutines are switched, and so each block of code is accessible via br_table.
The reason for the use of br_table is to support goroutines in WebAssembly's single threaded environment. The generated code is able to unwind and rewind the WebAssembly stack when goroutines are switched, and so each block of code is accessible via br_table.
Thank you for your reply.
Does that mean that the current solution of using br_table is a temporary solution to WebAssembly's single-threaded environment? Currently, the performance of go wasm is poor, is there any optimization plan for the use of br_table?
See huge discussion here on why WebAssembly is designed in such a way: https://github.com/WebAssembly/design/issues/796#issuecomment-401310366
So far this design flaw (my opinion) has not been resolved.
It might be possible to compile the innermost loop with the loop
instruction if it is not preemptible (contains no calls). On the other hand, nonpreemptible loop is not a good thing in general. Maybe it is fine on Wasm as there is no preemption anyway?
Cherry, I had the same thought looking at the code last night. I’m in that code atm working on the wasm32 port, I’ll do a little experimenting to see if we can optimize a call-less loop.
On Fri, Feb 2, 2024 at 7:53 AM cherrymui @.***> wrote:
It might be possible to compile the innermost loop with the loop instruction if it is not preemptible (contains no calls). On the other hand, nonpreemptible loop is not a good thing in general. Maybe it is fine on Wasm as there is no preemption anyway?
— Reply to this email directly, view it on GitHub https://github.com/golang/go/issues/65440#issuecomment-1924151598, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAAABZCPD7VI5IVLYHLQKTYRUDWJAVCNFSM6AAAAABCV5IEXGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMRUGE2TCNJZHA . You are receiving this because you are on a team that was mentioned.Message ID: @.***>
Even if we had threads I don't think this would allow us to rely on the host for scheduling, we need GC and resizable stacks which the host is unlikely to implement like we want it to. (wasm gc still has no support for inner pointers AFAIK)
After looking at WebAssembly/design#796 (really good link btw) I think we could emulate goto
using the newly introduced Tail Call extension, each basic block would be it's own function and we would use tail calls to jump between them. It might still be interesting to use
I don't know details of how wasm JITs are implemented, if this compiles to efficient amd64 code this should still be pretty fast, code locality is likely to be bad however.
Would need to be benchmarked before trying out an implementation. Support is terrible right now, only Firefox and Chrome ship it by default, most other applications support it with experimental flags, wasmer and webkit don't support it at all.
Edit turns out this was already proposed in https://github.com/WebAssembly/design/issues/796#issuecomment-460625886:
If we're still stuck there, and tail calls are moving forward, maybe it'd be worth asking language compilers to still translate to gotos, but as a final step before WebAssembly output, split up the "label blocks" into functions, and convert the gotos into tail calls.
I'm not sure we can use tail calls to emulate goto
s. With goto
(or branch table), it has the same set of local variables before and after the jump. With tail call, the local variables are all dead at the call. So we'd have to promote the local variables to global variables or something alike, which is certainly way more expensive, or pass them all as parameters, which complicates the code generation (and with unclear performance overhead).
@cherrymui I thought we didn't / rarely used local variables. Looking at the code it seems to me we use linear memory as the "stack" and rarely use the wasm stack. As long as we pass the "SP" value through the tail call it should be fine.
I suspect something like https://github.com/WebAssembly/stack-switching would make a lot of these problems go away. In this model, the Go runtime's scheduler would be the parent fiber of all other fibers, and all other fibers would be goroutines. Calling into the scheduler or onto the systemstack then just involves switching to the parent fiber.
We'll still probably need to use linear memory as the data stack, but at least we'll have a much better way to change what is executing.
I do not know what the status of Wasm stack switching is at this point in time.
@Jorropo we use local variables for "registers". The "shadow" stack in linear memory is used for non-registerized local variables like all other architectures. If we don't use "local variables", that essentially behaves like a non-registerized build, which would probably not be very performant.
I agree with @mknyszek that with the stack switching support, we probably don't need to use the top level loop and branch tables.
Go version
go version go1.21.6 linux/arm64
Output of
go env
in your module/workspace:What did you do?
As shown in the following example,
test.go
is compiled into wasm:test.go :
go to wasm compile command:
GOOS=wasip1 GOARCH=wasm go build -o test.wasm test.go
What did you see happen?
As you can see in the wat code, the
for
loop is expressed using thebr_table
operation.wat code:
What did you expect to see?
When the aot compiler of wasm runtime performs backend optimization, it is difficult to identify the
br_table
as afor
loop. So, during the backend optimization, thisfor
loop was not optimized.I tested several of the most popular wasm runtimes, such as wasmtime, wamr, and wasmer, and I found that the performance of go wasm after aot compilation was very poor, and the runtime performance was only 20% of go native in the best case. Why use so many
br_table
operation instead ofloop
operation? Will the performance of go wasm be optimized in the future?Also, I found that the wat code of the go runtime functions uses br_table a lot,the craziest function has 417 hops in the br_table.