cmd/compile: performance of go wasm is very poor

liutao-liu commented 9 months ago

Go version

go version go1.21.6 linux/arm64

Output of `go env` in your module/workspace:

GO111MODULE=''
GOARCH='arm64'
GOBIN=''
GOEXE=''
GOEXPERIMENT=''
GOFLAGS=''
GOHOSTARCH='arm64'
GOHOSTOS='linux'
GOINSECURE=''
GONOPROXY=''
GONOSUMDB=''
GOOS='linux'
GOPRIVATE=''
GOSUMDB='off'
GOTMPDIR=''
GOTOOLCHAIN='auto'
GOVCS=''
GOVERSION='go1.21.6'
GCCGO='gccgo'
AR='ar'
CC='gcc'
CXX='g++'
CGO_ENABLED='1'
GOWORK=''
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-fPIC -pthread -Wl,--no-gc-sections -fmessage-length=0 -ffile-prefix-map=/tmp/go-build3484327849=/tmp/go-build -gno-record-gcc-switches'

What did you do?

As shown in the following example,test.go is compiled into wasm:

test.go :

//go:noinline
func testFor() int {
    sum := 0
    for i := 0; i < 200; i++ {
        for j := 0; j < 10000000; j++ {
            sum += j
        }
    }
    return sum
}

func main() {
    startTime := time.Now()

    res := testFor()
    elapsed := time.Since(startTime)
    fmt.Println("Done. Cost: ", elapsed, res)
}

go to wasm compile command: GOOS=wasip1 GOARCH=wasm go build -o test.wasm test.go

What did you see happen?

As you can see in the wat code, the for loop is expressed using the br_table operation.

wat code:

  (func $main.testFor (type 0) (param i32) (result i32)
    (local i32 i64 i64 i64 i64)
    global.get 0
    local.set 1
    loop  ;; label = @1
      block  ;; label = @2
        block  ;; label = @3
          block  ;; label = @4
            block  ;; label = @5
              block  ;; label = @6
                block  ;; label = @7
                  block  ;; label = @8
                    block  ;; label = @9
                      local.get 0
                      br_table 0 (;@9;) 1 (;@8;) 2 (;@7;) 3 (;@6;) 4 (;@5;) 5 (;@4;) 6 (;@3;) 7 (;@2;)
                    end
                    i64.const 0
                    local.set 2
                    i64.const 0
                    local.set 3
                    i32.const 2
                    local.set 0
                    br 7 (;@1;)
                  end
                  local.get 2
                  i64.const 1
                  i64.add
                  local.set 2
                end
                local.get 2
                i64.const 200
                i64.lt_s
                i32.eqz
                if  ;; label = @7
                  i32.const 4
                  local.set 0
                  br 6 (;@1;)
                end
              end
              i64.const 0
              local.set 4
              i32.const 6
              local.set 0
              br 4 (;@1;)
            end
            local.get 1
            i64.extend_i32_u
            i64.const 8
            i64.add
            i32.wrap_i64
            local.get 3
            i64.store
            local.get 1
            i32.const 8
            i32.add
            local.tee 1
            global.set 0
            i32.const 0
            return
          end
          local.get 4
          i64.const 1
          i64.add
          local.set 5
          local.get 3
          local.get 4
          i64.add
          local.set 3
          local.get 5
          local.set 4
        end
        local.get 4
        i64.const 10000000
        i64.lt_s
        if  ;; label = @3
          i32.const 5
          local.set 0
          br 2 (;@1;)
        end
        i32.const 1
        local.set 0
        br 1 (;@1;)
      end
    end
    unreachable)

What did you expect to see?

When the aot compiler of wasm runtime performs backend optimization, it is difficult to identify the br_table as a for loop. So, during the backend optimization, this for loop was not optimized.

I tested several of the most popular wasm runtimes, such as wasmtime, wamr, and wasmer, and I found that the performance of go wasm after aot compilation was very poor, and the runtime performance was only 20% of go native in the best case. Why use so many br_table operation instead of loop operation? Will the performance of go wasm be optimized in the future?

Also, I found that the wat code of the go runtime functions uses br_table a lot,the craziest function has 417 hops in the br_table.

mauri870 commented 9 months ago

cc @golang/wasm @golang/compiler

liutao-liu commented 9 months ago

I manually modified the wat code to express the inner for loop using the loop operation, and the performance after aot compiled increased by 500%. (Added extra tests to make sure the function is working correctly.)

(func $main.testFor (type 0) (param i32) (result i32)
    (local i32 i64 i64 i64 i64)
    global.get 0
    local.set 1
    loop  ;; label = @1
      block  ;; label = @2
        block  ;; label = @3
          block  ;; label = @4
            block  ;; label = @5
              block  ;; label = @6
                block  ;; label = @7
                  block  ;; label = @8
                    block  ;; label = @9
                      local.get 0
                      br_table 0 (;@9;) 1 (;@8;) 2 (;@7;) 3 (;@6;) 4 (;@5;) 5 (;@4;) 6 (;@3;) 7 (;@2;)
                    end
                    i64.const 0
                    local.set 2 ;; i = 0
                    i64.const 0 
                    local.set 3 ;; sum = 0
                    i32.const 2 
                    local.set 0 
                    br 7 (;@1;)
                  end
                  local.get 2
                  i64.const 1
                  i64.add      ;; i++
                  local.set 2
                end
                local.get 2   
                i64.const 200
                i64.lt_s  
                i32.eqz
                if  ;; label = @7  
                  i32.const 4
                  local.set 0
                  br 6 (;@1;)  
                end
              end
              i64.const 0
              local.set 4 
              i32.const 6
              local.set 0
              br 4 (;@1;) 
            end
            local.get 1    
            i64.extend_i32_u
            i64.const 8
            i64.add
            i32.wrap_i64
            local.get 3  
            i64.store    
            local.get 1
            i32.const 8
            i32.add     
            local.tee 1
            global.set 0
            i32.const 0
            return
          end
            loop $myloop ;; inner for loop start
            local.get 4  ;; get j
            i64.const 1
            i64.add     
            local.set 5 
            local.get 3 
            local.get 4 
            i64.add  
            local.set 3 
            local.get 5
            local.set 4 
            local.get 4   
            i64.const 10000000
            i64.lt_s        
            br_if $myloop ;; inner for loop end
          end
        end
        local.get 4   ;; j
        i64.const 10000000
        i64.lt_s            ;; j < 10000000
        if  ;; label = @3
          i32.const 5
          local.set 0
          br 2 (;@1;)  
        end
        i32.const 1
        local.set 0
        br 1 (;@1;)
      end
    end
    unreachable)

evanphx commented 9 months ago

The reason for the use of br_table is to support goroutines in WebAssembly's single threaded environment. The generated code is able to unwind and rewind the WebAssembly stack when goroutines are switched, and so each block of code is accessible via br_table.

liutao-liu commented 9 months ago

The reason for the use of br_table is to support goroutines in WebAssembly's single threaded environment. The generated code is able to unwind and rewind the WebAssembly stack when goroutines are switched, and so each block of code is accessible via br_table.

Thank you for your reply.

Does that mean that the current solution of using br_table is a temporary solution to WebAssembly's single-threaded environment? Currently, the performance of go wasm is poor, is there any optimization plan for the use of br_table?

neelance commented 9 months ago

See huge discussion here on why WebAssembly is designed in such a way: https://github.com/WebAssembly/design/issues/796#issuecomment-401310366

So far this design flaw (my opinion) has not been resolved.

cherrymui commented 9 months ago

It might be possible to compile the innermost loop with the loop instruction if it is not preemptible (contains no calls). On the other hand, nonpreemptible loop is not a good thing in general. Maybe it is fine on Wasm as there is no preemption anyway?

evanphx commented 9 months ago

Cherry, I had the same thought looking at the code last night. I’m in that code atm working on the wasm32 port, I’ll do a little experimenting to see if we can optimize a call-less loop.

On Fri, Feb 2, 2024 at 7:53 AM cherrymui @.***> wrote:

It might be possible to compile the innermost loop with the loop instruction if it is not preemptible (contains no calls). On the other hand, nonpreemptible loop is not a good thing in general. Maybe it is fine on Wasm as there is no preemption anyway?

— Reply to this email directly, view it on GitHub https://github.com/golang/go/issues/65440#issuecomment-1924151598, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAAABZCPD7VI5IVLYHLQKTYRUDWJAVCNFSM6AAAAABCV5IEXGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMRUGE2TCNJZHA . You are receiving this because you are on a team that was mentioned.Message ID: @.***>

Jorropo commented 9 months ago

Even if we had threads I don't think this would allow us to rely on the host for scheduling, we need GC and resizable stacks which the host is unlikely to implement like we want it to. (wasm gc still has no support for inner pointers AFAIK)

Jorropo commented 9 months ago

After looking at WebAssembly/design#796 (really good link btw) I think we could emulate goto using the newly introduced Tail Call extension, each basic block would be it's own function and we would use tail calls to jump between them. It might still be interesting to use I don't know details of how wasm JITs are implemented, if this compiles to efficient amd64 code this should still be pretty fast, code locality is likely to be bad however.

Would need to be benchmarked before trying out an implementation. Support is terrible right now, only Firefox and Chrome ship it by default, most other applications support it with experimental flags, wasmer and webkit don't support it at all.

Edit turns out this was already proposed in https://github.com/WebAssembly/design/issues/796#issuecomment-460625886:

If we're still stuck there, and tail calls are moving forward, maybe it'd be worth asking language compilers to still translate to gotos, but as a final step before WebAssembly output, split up the "label blocks" into functions, and convert the gotos into tail calls.

cherrymui commented 9 months ago

I'm not sure we can use tail calls to emulate gotos. With goto (or branch table), it has the same set of local variables before and after the jump. With tail call, the local variables are all dead at the call. So we'd have to promote the local variables to global variables or something alike, which is certainly way more expensive, or pass them all as parameters, which complicates the code generation (and with unclear performance overhead).

Jorropo commented 9 months ago

@cherrymui I thought we didn't / rarely used local variables. Looking at the code it seems to me we use linear memory as the "stack" and rarely use the wasm stack. As long as we pass the "SP" value through the tail call it should be fine.

mknyszek commented 9 months ago

I suspect something like https://github.com/WebAssembly/stack-switching would make a lot of these problems go away. In this model, the Go runtime's scheduler would be the parent fiber of all other fibers, and all other fibers would be goroutines. Calling into the scheduler or onto the systemstack then just involves switching to the parent fiber.

We'll still probably need to use linear memory as the data stack, but at least we'll have a much better way to change what is executing.

I do not know what the status of Wasm stack switching is at this point in time.

cherrymui commented 9 months ago

@Jorropo we use local variables for "registers". The "shadow" stack in linear memory is used for non-registerized local variables like all other architectures. If we don't use "local variables", that essentially behaves like a non-registerized build, which would probably not be very performant.

I agree with @mknyszek that with the stack switching support, we probably don't need to use the top level loop and branch tables.

golang / go