Closed stuartcarnie closed 7 years ago
This also addresses issue #43
The Large
benchmark now caps out at 512k
This gist shows real-world numbers streaming a lot of data between two nodes
tl;dr
I will add a upper message type constant, so that the if
won’t need changing when validating the range and how this issue is typically addressed.
Here is the original code, where you will notice the call to newobject
, because the compiler is capturing s
to make handler
431 . . // Switch on the type
432 . . switch hdr.MsgType() {
433 . . case typeData:
434 3MB 3MB handler = s.handleStreamMessage
. . 1109973: MOVQ AX, 0(SP) session.go:434
3MB 3MB 1109977: CALL runtime.newobject(SB) session.go:434
. . 110997c: MOVQ 0x8(SP), AX session.go:434
. . 1109981: LEAQ 0x11f98(IP), CX session.go:434
. . 1109988: MOVQ CX, 0(AX) session.go:434
. . 110998b: TESTB AL, 0(AX) session.go:434
. . 110998d: MOVL 0x14a4dd(IP), DX session.go:434
. . 1109993: LEAQ 0x8(AX), BX session.go:434
. . 1109997: TESTL DX, DX session.go:434
. . 1109999: JNE 0x1109a05 session.go:434
. . 110999b: MOVQ 0xb8(SP), BX session.go:434
. . 11099a3: MOVQ BX, 0x8(AX) session.go:434
. . 11099a7: MOVQ 0x50(SP), CX session.go:434
. . 1109a0d: MOVQ BX, 0(SP) session.go:434
. . 1109a11: MOVQ 0xb8(SP), AX session.go:434
. . 1109a19: MOVQ AX, 0x8(SP) session.go:434
. . 1109a1e: CALL runtime.writebarrierptr(SB) session.go:434
. . 1109a23: MOVQ 0x80(SP), AX session.go:434
. . 1109a2b: LEAQ 0x11eee(IP), CX session.go:434
. . 1109a32: MOVQ 0xb8(SP), BX session.go:434
. . 1109a3a: JMP 0x11099a7 session.go:434
. . 1109a3f: LEAQ 0x3d0da(IP), AX session.go:434
I think I found the reason it allocates and your change will still make it allocate:
import (
"testing"
)
type X struct {
}
func (x *X) switchFunc(a *int) {
}
func BenchmarkSwitch(b *testing.B) {
x := X{}
handlers := []func(*X, *int){
0: (*X).switchFunc,
}
for i := 0; i < b.N; i++ {
a := 1
x.switchFunc(&a) // Doesn't allocate because it can do proper escape analysis because it knows what switchFunc does.
h := x.switchFunc
h(nil) // still doesn't allocate because nil.
h(&a) // Allocates because it can't do proper escape analysis for 'a' because it doesn't know what code h executes.
j := handlers[0]
j(&x, nil) // Doesn't allocate because nil.
j(&x, &a) // Still allocates for the same reason above.
}
}
So the only way to write this to not make it allocate is to write it out fully:
switch hdr.MsgType() {
case typeData:
if err := s.handleStreamMessage(hdr); err != nil {
return err
}
case typeWindowUpdate:
if err := s.handleStreamMessage(hdr); err != nil {
return err
}
case typeGoAway:
if err := s.handleGoAway(hdr); err != nil {
return err
}
case typePing:
if err := s.handlePing(hdr); err != nil {
return err
}
default:
return ErrInvalidMsgType
}
@erikdubbelboer it is not allocating because of a value passed to the function, it is allocating because the original logic was creating a closure, capturing s
. That is what I demonstrated in the assembly dump. If you look at the assembly for the new code, there is no newobject
call
if err := handlers[mt](s, hdr); err != nil {
0x1109a80 488b35995a1200 MOVQ 0x125a99(IP), SI
0x1109a87 488b3d9a5a1200 MOVQ 0x125a9a(IP), DI
// bounds check of handlers slice
0x1109a8e 0fb6db MOVZX BL, BX
0x1109a91 4839fb CMPQ DI, BX
// and jump to `panicindex` if CMPQ is BX is outside range
0x1109a94 0f83e5020000 JAE 0x1109d7f
// find address of function pointer in handlers array for mt; SI contains address of handlers slice
0x1109a9a 488b14de MOVQ 0(SI)(BX*8), DX
0x1109a9e 488b9c2498000000 MOVQ 0x98(SP), BX
// load hdr slice onto stack
0x1109aa6 48891c24 MOVQ BX, 0(SP)
0x1109aaa 48894c2408 MOVQ CX, 0x8(SP)
0x1109aaf 4889442410 MOVQ AX, 0x10(SP)
0x1109ab4 488b742448 MOVQ 0x48(SP), SI
0x1109ab9 4889742418 MOVQ SI, 0x18(SP)
// load function pointer of Stream handler into DI
0x1109abe 488b3a MOVQ 0(DX), DI
// call handler
0x1109ac1 ffd7 CALL DI
Why would it create a closure? That's not how closures work as far as I know. A closure would be when the code of the function is define there as well so it can use the scope around it.
func a() {
b := 1
c := func() { // c is now a closure that captures b.
println(b)
}
}
I'm 99% sure that your version still allocates because it can't know what the code will do with hdr so it needs to allocate it on the stack. The allocation probably just happens before the dump you posted. The reason it happens inside the case
in the original code might be because the default case returns immediately and doesn't use hdr.
Try my code with go test -bench . -benchmem
and uncomment the different variations:
package main
import (
"testing"
)
var g []byte
type X struct {
}
func (x *X) switchFunc(a []byte) {
//g = a // Uncommenting this will always make it allocate as it always escapes.
}
func BenchmarkSwitch(b *testing.B) {
x := &X{}
handlers := []func(*X, []byte){
0: (*X).switchFunc,
}
_ = handlers
for i := 0; i < b.N; i++ {
a := make([]byte, 12)
x.switchFunc(a) // Doesn't allocate because it can do proper escape analysis because it knows what switchFunc does.
h := x.switchFunc
h(nil) // still doesn't allocate because nil.
//h(a) // Allocates because it can't do proper escape analysis for 'a' because it doesn't know what code h executes.
j := handlers[0]
j(x, nil) // Doesn't allocate because nil.
//j(x, a) // Still allocates for the same reason above.
//handlers[0](x, a) // Still allocates for the same reason above.
// Doesn't allocate because nil, even though its a closure. The Go compiler is smart.
k := func() {
x.switchFunc(nil)
}
k()
// This allocates because it allocates a closure? Or the closure prevents escape analysis?
//l := func() {
// x.switchFunc(a)
//}
//l()
}
}
It appears to be a problem with escape analysis. The following will allocate memory in the methodValue
function. Note the placement of the handler
variable, outside the for
loop.
package methods
import (
"fmt"
"testing"
"math/rand"
)
type foo struct {
s string
}
func (f *foo) handle1(v string) bool {
fmt.Println(v)
return true
}
func (f *foo) handle2(v string) bool {
fmt.Println(f.s + v)
return true
}
func (f *foo) methodValue(i int) {
var handler func(string) bool
for {
switch i {
case 0:
handler = f.handle1
case 1:
handler = f.handle2
default:
panic("bad")
}
handler("hello")
return
}
}
func TestFooMethodValue(t *testing.T) {
f := &foo{}
f.methodValue(rand.Int() & 1)
}
Move the handler
variable inside the for
loop, and it no longer allocates.
func (f *foo) methodValue(i int) {
for {
var handler func(string) bool
switch i {
case 0:
handler = f.handle1
case 1:
handler = f.handle2
default:
panic("bad")
}
handler("hello")
return
}
}
running go build -gcflags '-m'
demonstrates that escape analysis is failing when capturing the Method Value. Note that in the explanation of a method value I have emphasized what I was explaining about a copy of that method value (the newobject
call in the disassembly).
If the expression x has static type T and M is in the method set of type T, x.M is called a method value. The method value x.M is a function value that is callable with the same arguments as a method call of x.M. The expression x is evaluated and saved during the evaluation of the method value; the saved copy is then used as the receiver in any calls, which may be executed later.
Escape analysis failed to determine that this copy did not escape, thus the newobject
call on these lines.
Here is an output of the escape analysis when the handler
value is outside the for
loop, demonstrating the compile is creating a copy of the method value.
) go build -gcflags '-m' ./cmd/methods
# github.com/stuartcarnie/play/cmd/methods
... lines removed for brevity
cmd/methods/method.go:22:29: leaking param: f
cmd/methods/method.go:27:15: f.handle1 escapes to heap <- here capturing the method value of foo
cmd/methods/method.go:30:15: f.handle2 escapes to heap <- here capturing the method value of foo
cmd/methods/method.go:42:12: &foo literal escapes to heap
... lines removed for brevity
after I move the handler
var into the loop, it no longer escapes
) go build -gcflags '-m' ./cmd/methods
# github.com/stuartcarnie/play/cmd/methods
... lines removed for brevity
cmd/methods/method.go:22:29: leaking param: f
cmd/methods/method.go:27:15: (*foo).methodValue f.handle1 does not escape
cmd/methods/method.go:30:15: (*foo).methodValue f.handle2 does not escape
cmd/methods/method.go:42:12: &foo literal escapes to heap
... lines removed for brevity
I see you were right, I was looking at the wrong allocation. Your way indeed saves one allocation, but I still think writing out the full switch is better. Your new solution still suffers from the issue where it can't do proper escape analysis on hdr
making it always escape. I haven't look at the code to see if other functions make hdr
escape anyways. But I think having a couple of lines code less is not worth it when it prevent escape analysis from working properly.
(The original code also prevented proper escape analysis on hdr
)
In this case we want hdr
allocated on the heap so it is reused for each iteration of the loop, to avoid unnecessary allocations :)
If it's on the stack it can still be reused since the frame for recvLoop
will stay on the stack. But I guess it doesn't matter that much since it's only one allocation per connection.
@stuartcarnie Thanks for the patch. I tried it out, with a small change to the large benchmark to send 512MB rather than 100MB. Pasting results below. Overall, this is looking really good. I'll note that I didn't quite see a 15% speedup, it was more around 8%. Memory usage efficiency gains look really good, it went from allocating 7MB to only needing to allocate 0.3MB, to send 512MB.
Master branch:
preetha@preetha-work ~/go/src/github.com/hashicorp/yamux (master) $go test -bench=. -benchmem
2017/10/04 17:26:55 [WARN] yamux: failed to send ping reply: connection write timeout
goos: linux
goarch: amd64
pkg: github.com/hashicorp/yamux
BenchmarkPing-4 200000 7455 ns/op 853 B/op 16 allocs/op
BenchmarkAccept-4 100000 11425 ns/op 2133 B/op 33 allocs/op
BenchmarkSendRecv-4 500000 3846 ns/op 289 B/op 6 allocs/op
BenchmarkSendRecvLarge-4 20 107960879 ns/op 7730291 B/op 20526 allocs/op
PASS
ok github.com/hashicorp/yamux 14.524s
Your changes:
preetha@preetha-work ~/go/src/github.com/hashicorp/yamux (perf) $go test -bench=. -benchmem
2017/10/04 17:28:13 [WARN] yamux: failed to send ping reply: connection write timeout
goos: linux
goarch: amd64
pkg: github.com/hashicorp/yamux
BenchmarkPing-4 200000 6416 ns/op 624 B/op 11 allocs/op
BenchmarkAccept-4 200000 9879 ns/op 1313 B/op 18 allocs/op
BenchmarkSendRecv-4 500000 3524 ns/op 80 B/op 2 allocs/op
BenchmarkSendRecvLarge-4 20 99652741 ns/op 337417 B/op 7412 allocs/op
I've got a couple of questions in the code review, other than that this is looking good.
If you sample the recvBuf.Cap()
after receiving 100MB, you will see that it is almost 100MB (before the change). After the change, it is closer to the maximum recv window size.
@stuartcarnie thanks for the explanation. I pushed just the benchmark to master with some debug output. I saw the recvBuf grow up to 16MB when sending 512MB, and yes indeed with your changes it caps out at 2X the window size or 0.5mb.
I am closing this PR as a number of additional commits have unintentionally been pulled in
Are you going to open a new PR? These changes seem very useful.
@stuartcarnie There is definite interest in these changes -- it should be easy enough to force push this branch and put the other changes into other PRs. Are you planning on following up?
@jefferai @erikdubbelboer @stuartcarnie I made a new PR that includes the relevant changes (with commit credits to @stuartcarnie ) https://github.com/hashicorp/yamux/pull/53
thanks @preetapan
improve memory utilization in receive buffer, fix flow control
Read
consumed the entire bufferObligatory benchmarks