golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
123.56k stars 17.61k forks source link

syscall/js: increase performance of Call, Invoke, and New by not allowing new slices to escape onto the heap #39740

Closed finnbear closed 6 months ago

finnbear commented 4 years ago

What version of Go are you using (go version)?

$ go version
go version go1.14.4 linux/amd64

Does this issue reproduce with the latest release?

Yes, in go1.15beta1

What did you do?

I used js.Call, js.New, and js.Invoke with multiple arguments, hundreds of times per frame, 60 frames per second in order to implement my WebGL game.

What did you expect to see?

The above functions are absolutely essential for getting work done in js/wasm, as they are analogous to a simple function call. They should be optimized as such. No excess go heap memory should be allocated, no extra go garbage collection. After implementing a hack to fix the issue I will go on to describe (see below), here is a CPU profile of memory allocation: image

Importantly, memory allocation is 8% of the total time and does not include makeArgs allocating any slices that end up on the heap

What did you see instead?

2+ new slices allocated on the heap per call to one of the above three functions, as the first few lines of the makeArgs function... https://github.com/golang/go/blob/60f78765022a59725121d3b800268adffe78bde3/src/syscall/js/js.go#L361-L363

...possibly are assumed to escape to the heap via... https://github.com/golang/go/blob/60f78765022a59725121d3b800268adffe78bde3/src/syscall/js/js.go#L405-L406 ...or valueNew/valueInvoke, or if go's escape analysis optimization doesn't work on the caller's stack.

A CPU profile shows the significant penalty associated with allocating too many slices (mallocgc now accounts for over 20%), and that doesn't even include the extra garbage collector load that claims a few FPS. image

I thought of adding //go:noescape before each of the above and potentially other functions implemented in javascript, but I didn't get any improvement right away. Only my hack to makeArgs worked consistently:

const maxArgs = 9 // A better hack/fix would grow the slices automatically
var argVals = make([]Value, 0, maxArgs)
var argRefs = make([]ref, 0, maxArgs)

func makeArgs(args []interface{}) ([]Value, []ref) {
    for i, _ := range argVals {
        argVals[i] = Value{}
    }
    for i, _ := range argRefs { // I don't think this is strictly necessary
        argRefs[i] = 0
    }

    argVals = argVals[:len(args)]
    argRefs = argRefs[:len(args)]

    for i, arg := range args {
        v := ValueOf(arg)
        argVals[i] = v
        argRefs[i] = v.ref
    }
    return argVals, argRefs
}

Another potential solution, building on the above hack, would be to make makeArgs take slices, to put the values/refs in, as arguments (instead of returning new slices). The caller could deal with efficiently handling the memory. Maybe this could help go's escape analysis, in conjunction with //go:noescape.

Side note: Ideally, a solution would address the heap allocation of an interface for certain arguments. Hundreds of thousands get allocated before the garbage collector runs, causing the garbage collector to take 12-20ms.

cagedmantis commented 4 years ago

/cc @neelance @cherrymui @hajimehoshi

neelance commented 4 years ago

@cherrymui Would this be a good case for sync.Pool? Would it be okay to restrict the maximum number of arguments so we can have a pool of slices with a fixed size?

finnbear commented 4 years ago

I've formulated 3 fully complete solutions to the problem: 1) Using sync/pool as suggested by @neelance: https://github.com/golang/go/compare/master...finnbear:makeargs-syncpool

My suggestion is for you to: 1) Investigate and implement solution 3 2) Investigate and implement solution 2 3) Try to get //go:noescape to work (I could not) 4) Investigate and implement solution 1 5) Do nothing

For what it's worth, I have validated all of the above solutions in my WebAssembly game to get an idea of their performance, and observed no crashing/panicking/issues.

@neelance @cherrymui @hajimehoshi

finnbear commented 3 years ago

I've been looking for a solution to this ever since I filed the issue, and ended up with a patch that solves the problem (both the slice problem and the interface escaping to heap problem) for me. Specifically, heap allocations per frame went from ~1500-40,000 (almost all related to syscall/js) to ~60-300 (which are unrelated to syscall/js). Note that this makes ValueOf unsafe to call except during Call, Invoke, and New. If this was being integrated into the standard library, one solution would be to leave ValueOf as-is and make a second, unexported version with the noescape change.

patch -u $GOROOT/src/syscall/js/js.go -i see_below.patch
--- js.go.dist  2021-01-23 15:50:00.931644132 -0800
+++ js.go   2021-01-23 17:31:54.938784167 -0800
@@ -145,6 +145,11 @@
    return valueGlobal
 }

+func noescape(foo interface{}) interface{} {
+   bar := uintptr(unsafe.Pointer(&foo))
+   return *((*interface{})(unsafe.Pointer(bar ^ 0)))
+}
+
 // ValueOf returns x as a JavaScript value:
 //
 //  | Go                     | JavaScript             |
@@ -159,7 +164,8 @@
 //  | map[string]interface{} | new object             |
 //
 // Panics if x is not one of the expected types.
-func ValueOf(x interface{}) Value {
+func ValueOf(a interface{}) Value {
+   x := noescape(a)
    switch x := x.(type) {
    case Value: // should precede Wrapper to avoid a loop
        return x
@@ -215,11 +221,12 @@
            o.Set(k, v)
        }
        return o
-   default:
-       panic("ValueOf: invalid value")
    }
+   runtime.KeepAlive(a)
+   panic("ValueOf: invalid value")
 }

+//go:noescape
 func stringVal(x string) ref

 // Type represents the JavaScript type of a Value.
@@ -303,6 +310,7 @@
    return r
 }

+//go:noescape
 func valueGet(v ref, p string) ref

 // Set sets the JavaScript property p of value v to ValueOf(x).
@@ -317,6 +325,7 @@
    runtime.KeepAlive(xv)
 }

+//go:noescape
 func valueSet(v ref, p string, x ref)

 // Delete deletes the JavaScript property p of value v.
@@ -329,6 +338,7 @@
    runtime.KeepAlive(v)
 }

+//go:noescape
 func valueDelete(v ref, p string)

 // Index returns JavaScript index i of value v.
@@ -342,6 +352,7 @@
    return r
 }

+//go:noescape
 func valueIndex(v ref, i int) ref

 // SetIndex sets the JavaScript index i of value v to ValueOf(x).
@@ -356,11 +367,24 @@
    runtime.KeepAlive(xv)
 }

+//go:noescape
 func valueSetIndex(v ref, i int, x ref)

-func makeArgs(args []interface{}) ([]Value, []ref) {
-   argVals := make([]Value, len(args))
-   argRefs := make([]ref, len(args))
+var (
+   argValsSlice []Value
+   argRefsSlice []ref
+)
+
+func makeArgs(args []interface{}) (argVals []Value, argRefs []ref) {
+   for i, _ := range argValsSlice {
+       argValsSlice[i] = Value{}
+   }
+   if len(args) > cap(argValsSlice) {
+       argValsSlice = make([]Value, 0, len(args))
+       argRefsSlice = make([]ref, 0, len(args))
+   }
+   argVals = argValsSlice[:len(args)]
+   argRefs = argRefsSlice[:len(args)]
    for i, arg := range args {
        v := ValueOf(arg)
        argVals[i] = v
@@ -380,6 +404,7 @@
    return r
 }

+//go:noescape
 func valueLength(v ref) int

 // Call does a JavaScript call to the method m of value v with the given arguments.
@@ -402,6 +427,7 @@
    return makeValue(res)
 }

+//go:noescape
 func valueCall(v ref, m string, args []ref) (ref, bool)

 // Invoke does a JavaScript call of the value v with the given arguments.
@@ -421,6 +447,7 @@
    return makeValue(res)
 }

+//go:noescape
 func valueInvoke(v ref, args []ref) (ref, bool)

 // New uses JavaScript's "new" operator with value v as constructor and the given arguments.
@@ -440,6 +467,7 @@
    return makeValue(res)
 }

+//go:noescape
 func valueNew(v ref, args []ref) (ref, bool)

 func (v Value) isNumber() bool {
@@ -539,8 +567,10 @@
    return string(b)
 }

+//go:noescape
 func valuePrepareString(v ref) (ref, int)

+//go:noescape
 func valueLoadString(v ref, b []byte)

 // InstanceOf reports whether v is an instance of type t according to JavaScript's instanceof operator.
@@ -551,6 +581,7 @@
    return r
 }

+//go:noescape
 func valueInstanceOf(v ref, t ref) bool

 // A ValueError occurs when a Value method is invoked on
@@ -577,6 +608,7 @@
    return n
 }

+//go:noescape
 func copyBytesToGo(dst []byte, src ref) (int, bool)

 // CopyBytesToJS copies bytes from src to dst.
@@ -591,4 +623,5 @@
    return n
 }

+//go:noescape
 func copyBytesToJS(dst ref, src []byte) (int, bool)
hajimehoshi commented 3 years ago

I'm excited about this suggestion and change, @finnbear!

hajimehoshi commented 3 years ago

Issues: I logically came to the conclusion that this is 100% safe but it may not seem that way (in the case of re-entrant recursive calls). It is also not future proof if multiple threads are to be supported.

I believe so too: the argument slice is used only when a callback is invoked from the Go side, and this invoking should never happen recursively. @neelance what do you think?

I don't have any insights about go:noescape, but do we really need them?

neelance commented 3 years ago

The documentation of go:noescape says:

The //go:noescape directive must be followed by a function declaration without a body (meaning that the function has an implementation not written in Go). It specifies that the function does not allow any of the pointers passed as arguments to escape into the heap or into the values returned from the function. This information can be used during the compiler's escape analysis of Go code calling the function.

If I understand this correctly, then //go:noescape makes no difference for func valueIndex(v ref, i int) ref since neither ref nor int are pointers.

For a case like func copyBytesToGo(dst []byte, src ref) (int, bool) it should be fine to apply //go:noescape because the pointer of []byte is only accessed inside of copyBytesToGo and not stored anywhere else.

It might also apply to string parameters.

I don't understand what's going on with x := noescape(a) but my gut feeling is bad. 😄 Please investigate which lines exactly are responsible for moving the value to the heap.

hajimehoshi commented 3 years ago

I was still wondering if we really need //go:noescape if 3. (using a single slice) could work. Or, am I missing something?

finnbear commented 3 years ago

I'm excited about this suggestion and change, @finnbear!

Me too!

I don't have any insights about go:noescape, but do we really need them?

Yes, at the minimum for valueNew, valueCall, valueInvoke, copyBytesToJS (because those are called quite often in my use case) and take slices. Without this directive, those functions maintain the pointers of their arguments for all GC knows, meaning that those arguments (slices and contents of those slices) must be stored on the heap.

If I understand this correctly, then //go:noescape makes no difference for func valueIndex(v ref, i int) ref since neither ref nor int are pointers.

You're 100% right. I just added it to all functions on the basis that none should cause pointers to escape, but it would have no effect when there are no pointers involved. It should be applied in the minimal number of cases though, so that one should be removed.

I believe so too: the argument slice is used only when a callback is invoked from the Go side, and this invoking should never happen recursively.

To elaborate on this, I think it is safe even if the call is re-entrant. The chronology of events would look like this:

  1. js.Global().Call("foo", 1)
  2. makeArgs sets global slice to [1]
  3. call to valueCall
  4. loadSliceOfValues transfers [1] to a JS copy
  5. call to user function "foo" with the JS copy
  6. user function calls Go function
  7. Go function calls js.Global().Call("bar", 2)
  8. makeArgs sets global slice to [2]
  9. call to valueCall
  10. loadSliceOfValues transfers [2] to a JS copy
  11. call to user function "bar" with the JS copy

I might be missing something but I don't think there is an issue with the global slice approach (as long as wasm/js is single threaded). Also, maybe //go:nosplit could be used to guarantee no weird edge case happens?

For a case like func copyBytesToGo(dst []byte, src ref) (int, bool) it should be fine to apply //go:noescape because the pointer of []byte is only accessed inside of copyBytesToGo and not stored anywhere else.

It might also apply to string parameters.

Yes, and yes.

I don't understand what's going on with x := noescape(a) but my gut feeling is bad. smile Please investigate which lines exactly are responsible for moving the value to the heap.

Yeah, I can see why that looks bad.

What it does: The noescape function disassociates the parameter a from the variable x from an escape analysis perspective. Despite the fact that the parameter can escape to the heap (such as in the case Value), it can only do so for the duration that it is on the caller's stack (if called from New, Call, or Invoke).

How it works: XOR-ing a pointer type with 0 doesn't change the pointer, but Go can't track that the pointer is the same.

Why it's needed: First, we need a command that shows us when and why things escape in src/syscall/js

Basic 'does it escape' check: GOOS=js GOARCH=wasm go build -gcflags="-m"

Running this, it can be determined that too many variables called args and the parameter to ValueOf escape to the heap

Advanced 'why does it escape' check: GOOS=js GOARCH=wasm go build -gcflags="-m -m"

Running this, it can be determined that there are two ways things escape.

  1. Through functions implemented in JS where Go doesn't know any better, hence //go:noescape
  2. Through ValueOf. It seems like the three cases responsible for this are Value (direct escape), Wrapper (interface/virtual function call, that Go can't know if the pointer is escaped via), and string (JS method stringVal, which may or may not be fixable with //go:noescape there)

How it could be made better/more "safe":

I was still wondering if we really need //go:noescape if 3. (using a single slice) could work. Or, am I missing something?

There are two types of heap allocations and slice(s) are only one of them. In my original proposal, I didn't propose a way to eliminate the second type of allocations: interfaces. As long as the slices contain pointers related to interfaces, and the ValueOf function goes unchanged, any parameters passed to Call, Invoke, and New will be heap allocated. Even word-sized values (int64, float64, etc.) and sub-word-sized values (byte, etc.) due to a lack of a compiler optimization (these types are common in a WebGL app). Getting rid of slice allocations is a good first and easy step, but both are issues. (Note: another possibility for slowing down interface allocations is to support pointer parameters i.e. both *float64 and float64. This would allow the caller to maintain a buffer of interfaces of those pointers, and reassign them before passing them to Call, Invoke, New, and ValueOf. This is 'a' solution but not something I would prefer as the caller.)

I'd be happy to investigate more (how the patch could be made better, safer, and more broadly applicable) and answer other questions, as I want to get something like this into the standard library so that something as basic as a function call doesn't allocate. Also, I've deployed the patch to my game and will continue to monitor it (so far no issues, bearing in mind I know not to call the new ValueOf function anywhere else. I already mentioned how much allocations improved, according to runtime.ReadMemStats).

Thanks @hajimehoshi and @neelance for developing wasm/js and considering this proposal as a way to improve it!

neelance commented 3 years ago

Okay, I think I slowly manage to wrap my head around this...

The x := noescape(a) is a hack, I think it is very unlikely for this to go upstream as it can easily have bugs. We should rather aim for a variant of ValueOf(x interface{}) Value with which Go can properly determine that x does not escape. This variant should probably return ref instead of Value.

You have mentioned the three problematic cases:

finnbear commented 3 years ago

@neelance

The x := noescape(a) is a hack

You're not wrong :wink:

I think it is very unlikely for this to go upstream as it can easily have bugs

I see what you mean. However, I have read runtime source code and seen similar hacks in performance critical segments like facilitating function calls and garbage collection. If your other idea didn't work, it might be worth consulting with other Go developers to see if the correctness of a solution can be proven despite it using a hack.

This variant should probably return ref instead of Value.

I did not consider that. Sounds good, since ref isn't a pointer.

we can explicitly garbage collect after valueCall

I hope you don't mean calling runtime.GC() on every single valueCall. That would "solve" the allocation while killing the performance. I think I understand what you mean: you can explicitly collect temporary refs created for a single function call (by neglecting to do this, I was leaking strings :cry:)

An interface value is not mutable, so maybe we can just copy it somehow?

You're right that the interface's pointer is immutable. However, the relevant issue is where that pointer points to (heap vs off heap). GC doesn't really know if the variant of JSValue used by an arbitrary Wrapper type is like this:

type BadWrapper struct {
    Value js.Value
}

var escapeRoute *BadWrapper

// Implements js.Wrapper
func (this *BadWrapper) JSValue() js.Value {
    escapeRoute = this
    return this.Value
}

so it assumes the worst. This is why a solution specific to Call, Invoke, and New may be warranted where the argument is known to be on the caller's stack (unless Wrapper was changed to return a ref and not a Value interface)

Side note: As far as I know, the only type I use that implements js.Wrapper is js.Func. If Func were used instead of Wrapper in the switch statement, maybe Go could figure out that the receiver is not maliciously escaping the pointer.

Maybe it would suffice to do y := x and then y.JSValue()?

I doubt this will work (I think GC will see right through it) but I can test it.

neelance commented 3 years ago

I doubt this will work (I think GC will see right through it) but I can test it.

Right, it won't work. I thought the problematic pointer was the interface itself, but it is the value that the interface contains.

neelance commented 3 years ago

The only option I see is to get rid of Wrapper...

I would be okay with the tradeoff. syscall/js is a low level package, so performance is more important than ease of use.

neelance commented 3 years ago

@dennwc What's your take on this? You contributed http://golang.org/cl/141644 which we would need to revert.

dennwc commented 3 years ago

I'm all for performance here, so feel free to revert.

That's unfortunate though, since it was quite useful to allow custom objects to be passed to native syscall/js functions. Without it high-level libraries had to wrap every method of syscall/js. This might be fine if performance is at stake here.

Is there any way we can solve this? E.g. make a private interface method on Wrapper, so only the syscall/js.Value can implement it, but still allow to embed it into custom type. Would it solve an issue?

finnbear commented 3 years ago

Is there any way we can solve this? E.g. make a private interface method on Wrapper, so only the syscall/js.Value can implement it, but still allow to embed it into custom type. Would it solve an issue?

Let's test it!

As a baseline, with no syscall/js optimizations, my game does 708 allocations per frame (I am benchmarking in a very constant state in game, so the numbers will be comparable later. Since I'm testing on a development site, I don't have access to the game state that generated 40,000 allocations per frame, but any improvement that follows should be more than proportional).

I'll start by applying //go:noescape to stringVal, because without that, we won't get anywhere with ValueOf's parameter x not escaping to the heap.

Here's the optimization output before your idea:

./js.go:162:14: parameter x leaks to {heap} with derefs=0:
./js.go:162:14:   flow: {temp} = x:
./js.go:162:14:   flow: x = {temp}:
./js.go:162:14:     from case Wrapper: return x.JSValue() (switch case) at ./js.go:166:2
./js.go:162:14:   flow: {heap} = x:
./js.go:162:14:     from x.JSValue() (call parameter) at ./js.go:167:19

And here is it after simply un-exporting the Wrapper JS value method:

./js.go:162:14: parameter x leaks to ~r1 with derefs=1:
./js.go:162:14:   flow: {temp} = x:
./js.go:162:14:   flow: x = *{temp}:
./js.go:162:14:     from case Value: return x (switch case) at ./js.go:164:2
./js.go:162:14:   flow: ~r1 = x:
./js.go:162:14:     from return x (return) at ./js.go:165:3 // case Value: return x

Under my understanding, ~r1 is the return value, which is far better than the heap. Now the question is: is this still sufficient to get the same level of allocation reduction as my earlier hack approach?

I'll continue testing by applying my makeArgs single slice patch, to get rid of slice allocations. My game now runs at 336 allocations per frame

Now I'll //go:noescape all JS-implemented functions that take pointers. Still 336 allocations per frame. Here's an optimization ouptut of Value.New:

./js.go:447:20: parameter args leaks to {heap} with derefs=1:
./js.go:447:20:   flow: args = args:
./js.go:447:20:     from makeArgs(args) (call parameter) at ./js.go:448:30
./js.go:447:20:   flow: {temp} = args:
./js.go:447:20:   flow: arg = *{temp}:
./js.go:447:20:     from for loop (range-deref) at ./js.go:380:16
./js.go:447:20:   flow: x = arg:
./js.go:447:20:     from ValueOf(arg) (call parameter) at ./js.go:381:15
./js.go:447:20:   flow: {temp} = x:
./js.go:447:20:   flow: x = {temp}:
./js.go:447:20:     from case Wrapper: return x.jsValue() (switch case) at ./js.go:166:2
./js.go:447:20:   flow: {heap} = x:
./js.go:447:20:     from x.jsValue() (call parameter) at ./js.go:167:19

So maybe wrapper is a problem after all...let's try removing it entirely and putting the concrete type Func in the switch case. Wow...only 59 allocations per frame. That's about how many allocations go on outside of syscall/js. And we didn't even have to do a hack.

Unfortunately, unless there is something I'm missing or you are willing to use the hack method, getting rid of Wrapper is what is required.

Here are the patches that ended up eliminating the syscall/js allocations: js.patch.txt func.patch.txt

Edit: There is some weirdness going on. Although allocations are definitively at 59 per frame, the optimization output from Value.New still says args escapes to heap:

./js.go:436:20: parameter args leaks to {heap} with derefs=2:
./js.go:436:20:   flow: args = args:
./js.go:436:20:     from makeArgs(args) (call parameter) at ./js.go:437:30
./js.go:436:20:   flow: {temp} = args:
./js.go:436:20:   flow: arg = *{temp}:
./js.go:436:20:     from for loop (range-deref) at ./js.go:369:16
./js.go:436:20:   flow: x = arg:
./js.go:436:20:     from ValueOf(arg) (call parameter) at ./js.go:370:15
./js.go:436:20:   flow: {temp} = x:
./js.go:436:20:   flow: x = *{temp}:
./js.go:436:20:     from case Func: return x.Value (switch case) at ./js.go:155:2
./js.go:436:20:   flow: ~r1 = x:
./js.go:436:20:     from x.Value (dot) at ./js.go:156:11
./js.go:436:20:     from return x.Value (return) at ./js.go:156:3
./js.go:436:20:   flow: v = ~r1:
./js.go:436:20:     from v := ValueOf(arg) (assign) at ./js.go:370:5
./js.go:436:20:   flow: {heap} = v:
./js.go:436:20:     from argVals[i] = v (assign) at ./js.go:371:14

And Invoke, which I call a lot, looks like this (GC also claims args escape to heap)

./js.go:416:23: parameter args leaks to {heap} with derefs=2:
./js.go:416:23:   flow: {heap} = **args:
./js.go:416:23:     from makeArgs(args) (call parameter) at ./js.go:417:30

Looks like they both "escape" via my single slice optimization in makeArgs.

There are a few possibilities:

  1. GC figures out that the functions don't escape when compiling them into a larger app like my game
  2. There are multiple types of escaping to the heap and some don't cause excessive memory allocations

@neelance @hajimehoshi @dennwc

neelance commented 3 years ago

Unfortunately, unless there is something I'm missing or you are willing to use the hack method, getting rid of Wrapper is what is required.

The "hack" is not a proper solution for upstream, right? Because as far as I can see, the hack is not a workaround to a shortcoming of the compiler. The complier is actually right that Wrapper can escape, so circumventing this check with a hack means that it can then break at runtime.

Removing Wrapper seems like the only way forward. I believe that this change needs to go through the proposal process. @finnbear Would you mind creating a small proposal for the removal of Wrapper with a summary of our discussion and post it as a new issue?

finnbear commented 3 years ago

The complier is actually right that Wrapper can escape

If we un-exported it, and implemented it in a way that it can't, that might not be the case. However, from what I have tried so far, even though it doesn't escape to heap, the allocations still happen somewhere in the call stack.

The "hack" is not a proper solution for upstream, right?

Not as written, no. It might be able to be adapted into a proper solution though.

Would you mind creating a small proposal for the removal of Wrapper

I would be happy to (after I spend about 30 more minutes trying different combinations of ways to keep Wrapper and eliminate allocations, just to make sure removing it is the only way)!

finnbear commented 3 years ago

@neelance proposal submitted: https://github.com/golang/go/issues/44006

Note that, as part of the proposal, I propose yet another way to solve this very issue. It intentionally makes no effort to make the argument/ref slices global in any way. Here's a snippet:

makeArgs gets replaced with:

func storeArgs(args []interface{}, argValsDst []Value, argRefsDst []ref) {
    for i, arg := range args {
        v := ValueOf(arg)
        argValsDst[i] = v
        argRefsDst[i] = v.ref
    }
}

// This function must be inlined
func makeArgs(args []interface{}) (argVals []Value, argRefs []ref) {
    const maxStackArgs = 16
    if len(args) <= maxStackArgs {
        // Allocated on the stack
        argVals = make([]Value, len(args), maxStackArgs)
        argRefs = make([]ref, len(args), maxStackArgs)
    } else {
        // Allocated on the heap
        argVals = make([]Value, len(args))
        argRefs = make([]ref, len(args))
    }
    return
}

Calls to makeArgs get replaced with:

argVals, argRefs := makeArgs(args)
storeArgs(args, argVals, argRefs)
neelance commented 3 years ago

@finnbear I just looked into your patch some more. The makeArgs/storeArgs solution is really clever. 🎯

Do the //go:noescape on valuePrepareString and valueLoadString make any difference?

finnbear commented 3 years ago

I just looked into your patch some more. The makeArgs/storeArgs solution is really clever.

Thanks! (it's just a workaround for a lack of a way to force the Go compiler to inline an arbitrarily big function)

Do the //go:noescape on valuePrepareString and valueLoadString make any difference?

No, those two don't actually matter. When I originally saw them, I misinterpreted them as Go string to JS String but they are for the reverse purpose. Unless syscall/js added a new API to load a JS String into a user's potentially-stack-allocated []byte, //go:noescape is redundant as make([]byte, length) with length being dynamic forces a heap allocation. This reasoning is supported by testing, as removing these //go:noescape's had no effect on my micro benchmark (even after adding a JS String to Go string operation).

@neelance

neelance commented 2 years ago

@finnbear I also have a hard time producing different results with //go:noescape on stringVal, valueGet, valueSet and valueDelete.

Is there any scenario in which a string pointer could point to the stack? Strings are immutable. If the value is a known constant at compile time, then it will be neither stack nor heap, but constant program data. If it is not known, then the length can not be known at compile time and thus it can not go to the stack and must end up on the heap. Am I missing something?

If my reasoning is true, then //go:noescape would not make any difference for string parameters. Maybe I should ask this on golang-dev...

finnbear commented 2 years ago

Is there any scenario in which a string pointer could point to the stack?

Yes, although I admit it may not be common. //go:noescape on valueGet, valueSet, and valueDelete is the difference between the 2nd and 3rd lines (optimization output) when the first line is compiled and passed as a field name.

// Always one byte, but not a statically known byte (therefore, string is on stack)
randStr := string([]byte{byte(60 + rand.Intn(10))})
js.Global().Get(randStr)

// With noescape
./main.go:109:22: string([]byte{...}) escapes to heap

// Without noescape
./main.go:109:22: string([]byte{...}) does not escape

//go:noescape on stringVal also has an effect on optimizer output.

./js.go:341:32: parameter x leaks to {heap} with derefs=1:
./js.go:341:32:   flow: x = x:
./js.go:341:32:     from ValueOf(x) (call parameter) at ./js.go:345:15
./js.go:341:32:   flow: {temp} = x:
./js.go:341:32:   flow: x = *{temp}:
./js.go:341:32:     from case string: return makeValue(stringVal(x)) (switch case) at ./js.go:193:2

All that being said, I could not readily reproduce a difference in reported allocations with respect to enabling //go:noescape in any or all of these cases (although I am using my test script, not my full application). As a result, I'm updating my theory/understanding about what causes escaping to heap in a normal program. Taking a look at the Wrapper case (removed by my proposal), we see

./js.go:348:32: parameter x leaks to {heap} with derefs=0:
./js.go:348:32:   flow: x = x:
./js.go:348:32:     from ValueOf(x) (call parameter) at ./js.go:352:15
./js.go:348:32:   flow: {temp} = x:
./js.go:348:32:   flow: x = {temp}:
./js.go:348:32:     from case Wrapper: return x.JSValue() (switch case) at ./js.go:161:2
./js.go:348:32:   flow: {heap} = x:
./js.go:348:32:     from x.JSValue() (call parameter) at ./js.go:162:19

The thing that I now believe is most relevant is derefs=?. It seems to me (although this is just a guess) that the derefs value reflects the net amount of de-referencing performed on an identifier, and that having it be 0 makes a huge difference for interface{}-based parameters. Specifically, if 0, all values passed into the parameter (with the exception of byte-sized ints) will escape:

// ./js.go:348:32: parameter x leaks to {heap} with derefs=0:
./main.go:76:46: rand.Intn(size) escapes to heap:
./main.go:76:46:   flow: {storage for ... argument} = &{storage for rand.Intn(size)}:
./main.go:76:46:     from rand.Intn(size) (spill) at ./main.go:76:46
./main.go:76:46:     from ... argument (slice-literal-element) at ./main.go:76:19
./main.go:76:46:   flow: {heap} = {storage for ... argument}:
./main.go:76:46:     from ... argument (spill) at ./main.go:76:19
./main.go:76:46:     from drawRect.Invoke(... argument...) (call parameter) at ./main.go:76:19

But if there is some non-zero amount of de-referencing, then static analysis can (at least sometimes) avoid heap allocations for some types of parameters. I think this is because the interface{}'s data pointer can still point to a structure on the stack, which could be an int that doesn't escape in any way or a struct that contains a potentially-escaping pointer.

// ./js.go:348:32: parameter x leaks to {heap} with derefs=1:
./main.go:76:78: inlining call to rand.Intn func(int) int { return rand.globalRand.Intn(rand.n) }

So, in summary:

Finally, here's a suggestion. Either:

  1. Apply //go:noescape wherever it is both correct and it is theoretically consequential or has any effect on optimizer output (run GOOS=js GOARCH=wasm go build -gcflags "-m -m" in src/syscall/js). This will give the Go compiler (current and future versions of it) maximum flexibility to optimize. If needed, document the fact that some of the //go:noescape's matter more than others.
  2. Ask a Go compiler developer for insight into derefs=? in the context of escape analysis, and go from there.
hajimehoshi commented 2 years ago

I'm not sure this is related but can handleEvent also reuse slices to avoid slice allocations?

https://cs.opensource.google/go/go/+/refs/tags/go1.17.2:src/syscall/js/func.go;l=92-96;drc=828bb0c123af11d21c82eb87b64dfa9af24858c7

    args := make([]Value, argsObj.Length())
    for i := range args {
        args[i] = argsObj.Index(i)
    }
    result := f(this, args)

EDIT: The patch will be like this:

@@ -68,6 +68,8 @@
        setEventHandler(handleEvent)
 }

+var eventArgVals = make([]Value, 0, maxArgs)
+
 func handleEvent() {
        cb := jsGo.Get("_pendingEvent")
        if cb.IsNull() {
@@ -89,10 +91,14 @@

        this := cb.Get("this")
        argsObj := cb.Get("args")
-       args := make([]Value, argsObj.Length())
-       for i := range args {
-               args[i] = argsObj.Index(i)
+       eventArgVals = eventArgVals[:argsObj.Length()]
+       for i := range eventArgVals {
+               eventArgVals[i] = argsObj.Index(i)
        }
-       result := f(this, args)
+       result := f(this, eventArgVals)
        cb.Set("result", result)
+
+       for i := range eventArgVals {
+               eventArgVals[i] = Value{}
+       }
 }
finnbear commented 2 years ago

can handleEvent also reuse slices to avoid slice allocations?

This can be dangerous if handleEvent is re-entrant (it theoretically can be, if the Go code causes a JavaScript immediate event: For example, if the Go code calls .click() on a <button> that has an event handler that runs Go again. Some browsers will synchronously invoke that handler, instead of putting it on the event queue. If handleEvent handles all JS->Go callbacks, that's an even bigger issue).

A safer solution would be more in line with makeArgs/storeArgs, where each call gets it's own slice, but slices below a certain length go on the stack (without allocating).

hajimehoshi commented 2 years ago

Ah OK, thanks. My understanding of your latest proposal is that if the capacity is a small constant value (16 in this case), the slice allocation will be on the stack. Is this correct? Does 16 depend on the current implementation?

finnbear commented 2 years ago

My understanding of your latest proposal is that if the capacity is a small constant value (16 in this case), the slice allocation will be on the stack. Is this correct?

Yes, as written, slices with len <= 16 will be allocated on the stack with a slice of cap = 16. Otherwise, the slice will be on the heap with an arbitrary len and cap >= len.

Note: Despite being reliably implemented, in my understanding, this optimization is not guaranteed by the Go specifications.

Does 16 depend on the current implementation?

The reason I chose 16 is it 1) is a power of two 2) is a realistic upper bound for number of arguments. If it were to be increased to 32, that might make some very rare functions allocate less, but would penalize much more common functions. If it were decreased to 4, many functions would need to allocate. I don't think 8 would be too bad, but it would affect https://developer.mozilla.org/en-US/docs/Web/API/WebGLRenderingContext/texImage2D.

Note: There may be a number greater than 16 that would cause the Go compiler to not put the slice on the stack. I didn't attempt to find it.

hajimehoshi commented 2 years ago

Another question: how can we make sure makeArgs is inlined?

hajimehoshi commented 2 years ago

This can be dangerous if handleEvent is re-entrant (it theoretically can be, if the Go code causes a JavaScript immediate event:

What about making a global stack?

var eventArgValsStack []Value

func getEventArgs(argsObj Value) (argVals []Value) {
        l := argsObj.Length()
        if cap(eventArgValsStack)-len(eventArgValsStack) < l {
                eventArgValsStack = append(eventArgValsStack, make([]Value, l)...)
        } else {
                eventArgValsStack = eventArgValsStack[:len(eventArgValsStack)+l]
        }

        argVals = eventArgValsStack[len(eventArgValsStack)-l:]
        for i := range argVals {
                argVals[i] = argsObj.Index(i)
        }
        return
}

func putEventArgs(n int) {
        for i := 0; i < n; i++ {
                eventArgValsStack[len(eventArgValsStack)-i-1] = Value{}
        }
        eventArgValsStack = eventArgValsStack[:len(eventArgValsStack)-n]
}

func handleEvent() {
        cb := jsGo.Get("_pendingEvent")
        if cb.IsNull() {
                return
        }
        jsGo.Set("_pendingEvent", Null())

        id := uint32(cb.Get("id").Int())
        if id == 0 { // zero indicates deadlock                                                                                                                                                                                                                                       
                select {}
        }
        funcsMu.Lock()
        f, ok := funcs[id]
        funcsMu.Unlock()
        if !ok {
                Global().Get("console").Call("error", "call to released function")
                return
        }

        this := cb.Get("this")
        argsObj := cb.Get("args")
        const maxStackArgs = 16
        args := getEventArgs(argsObj)
        defer putEventArgs(argsObj.Length())
        result := f(this, args)
        cb.Set("result", result)
}

I tried your latest suggestion (using a constant capacity whenever possible) but apparently the slice was allocated on heap... (I've confirmed this by watching runtime.mallocgc and calling throw at some timing)

finnbear commented 2 years ago

Another question: how can we make sure makeArgs is inlined?

In my understanding, there is no way to force Go to inline something other than in-lining it manually. If you compile with gcflags -m -m (or something similar), it will tell you the reason for its decision (usually related to the "cost" of the function). I split up makeArgs/storeArgs so that makeArgs would be under the "cost" threshold for inlining at the time of writing.

I tried your latest suggestion (using a constant capacity whenever possible) but apparently the slice was allocated on heap...

I should have realized this, but the args slice (in the context of handleEvent) will always escape to the heap because f is dynamically dispatched. Therefore I don't think the Go compiler will ever allow it to reside on the stack.

What about making a global stack?

Your "global stack" solution is really clever (if not obviously correct, so I'm assuming it works when you test it). I think it would solve the allocation and re-entrance problem if the client code can be trusted to not mutate the args slice, which could corrupt the arguments of other calls.

hajimehoshi commented 2 years ago

Thanks!

if the client code can be trusted to not mutate the args slice, which could corrupt the arguments of other calls.

How does a function corrupt the args slice for other functions (its superiors and/or its subsidies)?

finnbear commented 2 years ago

How does a function corrupts the args slice for other functions (its superiors and/or its subsidies)?

I was thinking if Go func1 with args slice [ptr=1, 2, 3] induces a JS event to call Go func2. with args [1, 2, 3, ptr=4, 5, 6]. If func1 was to call append before func2 finished executing, that could overwrite 4, 5, and 6, causing a problem. If func2 is synchronous, this should be impossible, though.

Actually, the root problem is that no Go func would be able keep (in a new goroutine)/store (on globally/heap) its args slice after it returns (which would work fine without this optimization).

I'm not too worried about this, but it should be documented to warn users.

hajimehoshi commented 2 years ago

If func1 was to call append before func2 finished executing, that could overwrite 4, 5, and 6, causing a problem. If func2 is synchronous, this should be impossible, though.

Ah, with goroutines? Yeah, that might be in theory possible... Hmm...

Actually, the root problem is that no Go func would be able keep (in a new goroutine)/store (on globally/heap) its args slice after it returns (which would work fine without this optimization).

Yeah that's a good point.

I'm not too worried about this, but it should be documented to warn users.

I think it should be fine just to warn users not to corrupt the argument slices.

@neelance What do you think about my 'global stack' proposal? This idea should work for both makeArgs and handleEvent. Fortunately, syscall/js is out of the scope of the Go 1 compatibility policy.

finnbear commented 2 years ago

I think it should be fine just to warn users not to corrupt the argument slices.

I like the warning idea, but that particular warning not sufficient to avoid issues. Not only can you not mutate beyond the len the slices but you cannot store them (heap/global) them or use them in a new goroutine. The gouroutine restriction, which I believe will be the most common pitfall, is reminiscent of one of the constraints on using an iterator variable like:

items := []string{"first", "second", "third"}

for _, item := range items {
    go func() {
        println(item)
    }()
}

// Output:
// third
// third
// third
hajimehoshi commented 2 years ago

Hm, that's unfortunate... Thank you for pointing this out.

BTW, I think my idea can be applied for makeArgs safely as the slice is not used in Go side. Is that correct?

finnbear commented 2 years ago

BTW, I think my idea can be applied for makeArgs safely as the slice is not used in Go side. Is that correct?

I think that would work, but consider that...

1) I don't think that case suffers from re-entrance issues (even if it was re-entrant, the args would have already been copied out of the memory on the JS side), so you might even be able to get away with a single shared slice like in my original comment in this issue. (see https://github.com/golang/go/issues/39740#issuecomment-766938606) 2) The slice doesn't escape the stack, so the the makeArgs/storeArgs does work (at the cost of a tiny bit of stack memory) 3) makeArgs/storeArgs should easily support multithreaded WASM. Global slice/global stack will need a Mutex.

hajimehoshi commented 2 years ago

I don't think that case suffers from re-entrance issues (even if it was re-entrant, the args would have already been copied out of the memory on the JS side), so you might even be able to get away with a single shared slice like in my original comment in this issue.

It this is re-entrant, the Value slice must be kept in Go side not to be GCed until the call finishes, right? Anyway as long as this is not re-entrant, yes, your 'global slice' idea works.

The slice doesn't escape the stack, so the the makeArgs/storeArgs does work (at the cost of a tiny bit of stack memory)

Hm? We have found that the slice was not on the stack due to dynamic dispatching, right?

finnbear commented 2 years ago

Hm? We have found that the slice was not on the stack due to dynamic dispatching, right?

Oh sorry, my entire previous comment was in the context of Call, New, Invoke, and makeArgs (no dynamic dispatch) as I thought that was what you were referring to. In the context of handleEvent and (dynamically dispatched) f, none of what I just said is true.

hajimehoshi commented 2 years ago

Oh sorry, my entire previous comment was in the context of Call, New, Invoke, and makeArgs (no dynamic dispatch) as I thought that was what you were referring to. In the context of handleEvent and f, none of what I just said is true.

Sorry for my confusing comment at https://github.com/golang/go/issues/39740#issuecomment-955193617, but I meant makeArgs/storeArgs didn't work for both Call/New/Invoke and handleEvent. My understanding is that the dynamic-dispatching is

    if len(args) <= maxStackArgs {
        // Allocated on the stack
        argVals = make([]Value, len(args), maxStackArgs)
        argRefs = make([]ref, len(args), maxStackArgs)
    } else {
        // Allocated on the heap
        argVals = make([]Value, len(args))
        argRefs = make([]ref, len(args))
    }

and unfortunately the slices are not on the stack for both cases. I'll double-check this later.

finnbear commented 2 years ago

the dynamic-dispatching is [if/else]

Which branch is taken is dynamic, but there is no dispatch (calling a function) here. All the code there is known statically and, crucially, inlined into the caller. There is no rule that says a given variable (e.g. argVals) must always point to the stack or always point to the heap. If the Go compiler determines that it is safe to put a certain structure on the stack (e.g. the data for make([]Value, len(args), maxStackArgs)), it is perfectly fine to assign a heap allocated structure (e.g. the data for make([]Value, len(args))) to the same variable.

Side note: The reason it is safe is that Go can trace the flow of the argVals/argRefs slices through their entire lifetime, and they are only ever passed down (as long as makeArgs is inlined) the callstack (eventually to JS). They never have an opportunity to interact with arbitrary user code.

unfortunately the slices are not on the stack for both cases

That's not what I'm seeing. It's possible that the Go compiler version matters, though, so I include my version info.

func makeArgs(args []interface{}) (argVals []Value, argRefs []ref) {
    const maxStackArgs = 16
    if len(args) <= maxStackArgs {
        // Allocated on the stack
        argVals = make([]Value, len(args), maxStackArgs)
        argRefs = make([]ref, len(args), maxStackArgs)
    } else {
        // Allocated on the heap
        argVals = make([]Value, len(args))
        argRefs = make([]ref, len(args))
    }
    return
}

func (v Value) Call(m string, args ...interface{}) Value {
    argVals, argRefs := makeArgs(args) // 394
    storeArgs(args, argVals, argRefs)
    res, ok := valueCall(v.ref, m, argRefs)
    ...
}
$ go version
go version go1.16.2 linux/amd64

$ cd go1.16beta1/src/syscall/js  # this just happens to be where I developed the patch
$ GOOS=js GOARCH=wasm go build -gcflags "-m -m"
...
./js.go:394:30: inlining call to makeArgs <---- IMPORTANT
...
./js.go:394:30: make([]Value, len(args), maxStackArgs) does not escape <------ HERE
./js.go:394:30: make([]ref, len(args), maxStackArgs) does not escape
./js.go:394:30: make([]Value, len(args)) escapes to heap
./js.go:394:30: make([]ref, len(args)) escapes to heap
...

and handleEvent

Here, we actually see dynamic dispatch in action (Normally, the term dynamic dispatch is used in the context of polymorphism, but I just mean calling a function known only at runtime).

f, ok := funcs[id]
...
result := f(this, args)

As far as the Go compiler is concerned, f could take args and store it on the heap and/or in global variables, that outlive the current function's stack frame. Therefore, the data backing args must not be on the stack.

hajimehoshi commented 2 years ago

I just want to leave a comment that my 'global stack' suggestion might not work with multiple goroutines (w/o mutex).

hajimehoshi commented 2 years ago

That's not what I'm seeing. It's possible that the Go compiler version matters, though, so I include my version info.

OK I'll double-check this today... Thanks! (Sorry for my terribly late comment)

hajimehoshi commented 2 years ago

@finnbear

.../js.patched.go:412:17: make([]Value, len(args), maxStackArgs) escapes to heap
.../js.patched.go:413:17: make([]ref, len(args), maxStackArgs) escapes to heap

Unfortunately, with Go 1.17.2 (darwin), these values seem to escape.

EDIT: makeArgs is inlined like inlining call to makeArgs

hajimehoshi commented 2 years ago

With -m -m:

.../js.patched.go:417:17: make([]Value, len(args)) escapes to heap:
.../js.patched.go:417:17:   flow: {heap} = &{storage for make([]Value, len(args))}:
.../js.patched.go:417:17:     from make([]Value, len(args)) (non-constant size) at .../js.patched.go:417:17
.../js.patched.go:418:17: make([]ref, len(args)) escapes to heap:
.../js.patched.go:418:17:   flow: {heap} = &{storage for make([]ref, len(args))}:
.../js.patched.go:418:17:     from make([]ref, len(args)) (non-constant size) at .../js.patched.go:418:17
.../js.patched.go:414:17: make([]ref, len(args), maxStackArgs) escapes to heap:
.../js.patched.go:414:17:   flow: argRefs = &{storage for make([]ref, len(args), maxStackArgs)}:
.../js.patched.go:414:17:     from make([]ref, len(args), maxStackArgs) (spill) at .../js.patched.go:414:17
.../js.patched.go:414:17:     from argRefs = make([]ref, len(args), maxStackArgs) (assign) at .../js.patched.go:414:11
.../js.patched.go:413:17: make([]Value, len(args), maxStackArgs) escapes to heap:
.../js.patched.go:413:17:   flow: argVals = &{storage for make([]Value, len(args), maxStackArgs)}:
.../js.patched.go:413:17:     from make([]Value, len(args), maxStackArgs) (spill) at .../js.patched.go:413:17
.../js.patched.go:413:17:     from argVals = make([]Value, len(args), maxStackArgs) (assign) at .../js.patched.go:413:11
finnbear commented 2 years ago

@finnbear

.../js.patched.go:412:17: make([]Value, len(args), maxStackArgs) escapes to heap
.../js.patched.go:413:17: make([]ref, len(args), maxStackArgs) escapes to heap

Unfortunately, with Go 1.17.2 (darwin), these values seem to escape.

EDIT: makeArgs is inlined like inlining call to makeArgs

I'd like to investigate this, but just so we are on the same page, can you ZIP and attach your entire src/syscall/js directory? That way, I can figure out if the issue is a difference in code or a difference in compiler version. it also makes it possible to compare line numbers with your compiler output (I see the escape reason "spill" which shouldn't be happening).

hajimehoshi commented 2 years ago

I'm using -overlay so I replaced only js.go. I'll send it to you soon.

hajimehoshi commented 2 years ago

@finnbear Here you are js.go.zip

hajimehoshi commented 2 years ago

By inlining makeArgs by hand, both argVals and argRefs are on the stack (of course this also requires //go:noescape to valueCall and so on)

finnbear commented 2 years ago

I downloaded go1.17.2 and inserted your js.go, and I think I can see the misunderstanding.

  1. All the slices do escape the stack frame of the standalone makeArgs (which is fine, since makeArgs is never called on its own)
  2. But, crucially, when makeArgs is (automatically) inlined into Call, New, or Invoke, the cap=16 slices do not escape from Call's, New's, or Invoke's stack frames.

As you can see below, the first two lines of output are what you observed, but the remaining lines tell the full story.

~/sdk/go1.17.2/src/syscall/js$ GOOS=js GOARCH=wasm ~/go/bin/go1.17.2 build -gcflags "-m -m"

...

# line 412:         argVals = make([]Value, len(args), maxStackArgs) in makeArgs
./js.go:412:17: make([]Value, len(args), maxStackArgs) escapes to heap
# line 413:     argRefs = make([]ref, len(args), maxStackArgs) in makeArgs
./js.go:413:17: make([]ref, len(args), maxStackArgs) escapes to heap
# line 416      argVals = make([]Value, len(args)) in makeArgs
./js.go:416:17: make([]Value, len(args)) escapes to heap
# line 417      argRefs = make([]ref, len(args)) in makeArgs
./js.go:417:17: make([]ref, len(args)) escapes to heap

...

# line 434: argVals, argRefs := makeArgs(args) in Call
./js.go:434:30: make([]Value, len(args), maxStackArgs) does not escape
./js.go:434:30: make([]ref, len(args), maxStackArgs) does not escape
./js.go:434:30: make([]Value, len(args)) escapes to heap
./js.go:434:30: make([]ref, len(args)) escapes to heap

...

# line 459:     argVals, argRefs := makeArgs(args) in Invoke
./js.go:459:30: make([]Value, len(args), maxStackArgs) does not escape
./js.go:459:30: make([]ref, len(args), maxStackArgs) does not escape
./js.go:459:30: make([]Value, len(args)) escapes to heap
./js.go:459:30: make([]ref, len(args)) escapes to heap

...

# line 481: argVals, argRefs := makeArgs(args) in New
./js.go:481:30: make([]Value, len(args), maxStackArgs) does not escape
./js.go:481:30: make([]ref, len(args), maxStackArgs) does not escape
./js.go:481:30: make([]Value, len(args)) escapes to heap
./js.go:481:30: make([]ref, len(args)) escapes to heap

By inlining makeArgs by hand, both argVals and argRefs are on the stack (of course this also requires //go:noescape to valueCall and so on)

Manually inlining will disambiguate the optimizer output, but is not necessary, as the automatically-inlined versions (In Call, New, and Invoke) are the only ones that are ever called.

@hajimehoshi

hajimehoshi commented 2 years ago
./js.go:434:30: make([]Value, len(args), maxStackArgs) does not escape
./js.go:434:30: make([]ref, len(args), maxStackArgs) does not escape
./js.go:434:30: make([]Value, len(args)) escapes to heap
./js.go:434:30: make([]ref, len(args)) escapes to heap

I've confirmed this and so on. Thanks!

I think that'd be very nice to have this change in the official syscall/js. Do you have a plan to send a change?

@finnbear