Fix window update, reduce memory utilization and improve performance

stuartcarnie commented 7 years ago

improve memory utilization in receive buffer, fix flow control

flow control was assuming a Read consumed the entire buffer
introduced fixed receive buffer on streams to reduce memory utilization when receiving large streams of data
use timer pool to reduce allocations
use static handler function pointer to avoid closure allocations for every frame

Obligatory benchmarks

benchmark                    old ns/op     new ns/op     delta
BenchmarkSendRecv-2          3942          3655          -7.28%
BenchmarkSendRecvLarge-2     28408768      23931960      -15.76%

benchmark                    old allocs     new allocs     delta
BenchmarkSendRecv-2          6              2              -66.67%
BenchmarkSendRecvLarge-2     4012           1236           -69.19%

benchmark                    old bytes     new bytes     delta
BenchmarkSendRecv-2          302           83            -72.52%
BenchmarkSendRecvLarge-2     1538987       57536         -96.26%

stuartcarnie commented 7 years ago

This also addresses issue #43

The Large benchmark now caps out at 512k

stuartcarnie commented 7 years ago

This gist shows real-world numbers streaming a lot of data between two nodes

tl;dr

total allocations went from 8.02GB down to 2.87MB or a 2,808 times improvement
in use memory (for large stream) went from 4.01GB down to 815.27KB or a 4,920 times improvement

stuartcarnie commented 7 years ago

I will add a upper message type constant, so that the if won’t need changing when validating the range and how this issue is typically addressed.

stuartcarnie commented 7 years ago

Here is the original code, where you will notice the call to newobject, because the compiler is capturing s to make handler

    431            .          .         // Switch on the type 
    432            .          .         switch hdr.MsgType() { 
    433            .          .         case typeData: 
    434          3MB        3MB             handler = s.handleStreamMessage 
                   .          .  1109973: MOVQ AX, 0(SP)                                      session.go:434
                 3MB        3MB  1109977: CALL runtime.newobject(SB)                          session.go:434
                   .          .  110997c: MOVQ 0x8(SP), AX                                  session.go:434
                   .          .  1109981: LEAQ 0x11f98(IP), CX                              session.go:434
                   .          .  1109988: MOVQ CX, 0(AX)                                      session.go:434
                   .          .  110998b: TESTB AL, 0(AX)                                    session.go:434
                   .          .  110998d: MOVL 0x14a4dd(IP), DX                            session.go:434
                   .          .  1109993: LEAQ 0x8(AX), BX                                  session.go:434
                   .          .  1109997: TESTL DX, DX                                      session.go:434
                   .          .  1109999: JNE 0x1109a05                                    session.go:434
                   .          .  110999b: MOVQ 0xb8(SP), BX                                session.go:434
                   .          .  11099a3: MOVQ BX, 0x8(AX)                                  session.go:434
                   .          .  11099a7: MOVQ 0x50(SP), CX                                session.go:434
                   .          .  1109a0d: MOVQ BX, 0(SP)                                      session.go:434
                   .          .  1109a11: MOVQ 0xb8(SP), AX                                session.go:434
                   .          .  1109a19: MOVQ AX, 0x8(SP)                                  session.go:434
                   .          .  1109a1e: CALL runtime.writebarrierptr(SB)                  session.go:434
                   .          .  1109a23: MOVQ 0x80(SP), AX                                session.go:434
                   .          .  1109a2b: LEAQ 0x11eee(IP), CX                              session.go:434
                   .          .  1109a32: MOVQ 0xb8(SP), BX                                session.go:434
                   .          .  1109a3a: JMP 0x11099a7                                    session.go:434
                   .          .  1109a3f: LEAQ 0x3d0da(IP), AX                              session.go:434

erikdubbelboer commented 7 years ago

I think I found the reason it allocates and your change will still make it allocate:

import (
  "testing"
)

type X struct {
}

func (x *X) switchFunc(a *int) {
}

func BenchmarkSwitch(b *testing.B) {
  x := X{}
  handlers := []func(*X, *int){
    0: (*X).switchFunc,
  }
  for i := 0; i < b.N; i++ {
    a := 1

    x.switchFunc(&a) // Doesn't allocate because it can do proper escape analysis because it knows what switchFunc does.

    h := x.switchFunc
    h(nil) // still doesn't allocate because nil.
    h(&a)  // Allocates because it can't do proper escape analysis for 'a' because it doesn't know what code h executes.

    j := handlers[0]
    j(&x, nil) // Doesn't allocate because nil.
    j(&x, &a)  // Still allocates for the same reason above.
  }
}

So the only way to write this to not make it allocate is to write it out fully:

  switch hdr.MsgType() {
  case typeData:
    if err := s.handleStreamMessage(hdr); err != nil {
      return err
    }
  case typeWindowUpdate:
    if err := s.handleStreamMessage(hdr); err != nil {
      return err
    }
  case typeGoAway:          
    if err := s.handleGoAway(hdr); err != nil {
      return err
    }
  case typePing:                        
    if err := s.handlePing(hdr); err != nil {
      return err
    }
  default:                                          
    return ErrInvalidMsgType                            
  }

stuartcarnie commented 7 years ago

@erikdubbelboer it is not allocating because of a value passed to the function, it is allocating because the original logic was creating a closure, capturing s. That is what I demonstrated in the assembly dump. If you look at the assembly for the new code, there is no newobject call

               if err := handlers[mt](s, hdr); err != nil {
  0x1109a80             488b35995a1200          MOVQ 0x125a99(IP), SI
  0x1109a87             488b3d9a5a1200          MOVQ 0x125a9a(IP), DI
// bounds check of handlers slice
  0x1109a8e             0fb6db                  MOVZX BL, BX
  0x1109a91             4839fb                  CMPQ DI, BX
// and jump to `panicindex` if CMPQ is BX is outside range
  0x1109a94             0f83e5020000            JAE 0x1109d7f
// find address of function pointer in handlers array for mt; SI contains address of handlers slice
  0x1109a9a             488b14de                MOVQ 0(SI)(BX*8), DX
  0x1109a9e             488b9c2498000000        MOVQ 0x98(SP), BX
// load hdr slice onto stack
  0x1109aa6             48891c24                MOVQ BX, 0(SP)
  0x1109aaa             48894c2408              MOVQ CX, 0x8(SP)
  0x1109aaf             4889442410              MOVQ AX, 0x10(SP)
  0x1109ab4             488b742448              MOVQ 0x48(SP), SI
  0x1109ab9             4889742418              MOVQ SI, 0x18(SP)
// load function pointer of Stream handler into DI
  0x1109abe             488b3a                  MOVQ 0(DX), DI
// call handler
  0x1109ac1             ffd7                    CALL DI

erikdubbelboer commented 7 years ago

Why would it create a closure? That's not how closures work as far as I know. A closure would be when the code of the function is define there as well so it can use the scope around it.

func a() {
  b := 1
  c := func() { // c is now a closure that captures b.
    println(b)
  }
}

I'm 99% sure that your version still allocates because it can't know what the code will do with hdr so it needs to allocate it on the stack. The allocation probably just happens before the dump you posted. The reason it happens inside the case in the original code might be because the default case returns immediately and doesn't use hdr.

Try my code with go test -bench . -benchmem and uncomment the different variations:

package main

import (
  "testing"
)

var g []byte

type X struct {
}

func (x *X) switchFunc(a []byte) {
  //g = a // Uncommenting this will always make it allocate as it always escapes.
}

func BenchmarkSwitch(b *testing.B) {
  x := &X{}
  handlers := []func(*X, []byte){
    0: (*X).switchFunc,
  }
  _ = handlers
  for i := 0; i < b.N; i++ {
    a := make([]byte, 12)

    x.switchFunc(a) // Doesn't allocate because it can do proper escape analysis because it knows what switchFunc does.

    h := x.switchFunc
    h(nil) // still doesn't allocate because nil.
    //h(a)   // Allocates because it can't do proper escape analysis for 'a' because it doesn't know what code h executes.

    j := handlers[0]
    j(x, nil) // Doesn't allocate because nil.
    //j(x, a)   // Still allocates for the same reason above.

    //handlers[0](x, a) // Still allocates for the same reason above.

    // Doesn't allocate because nil, even though its a closure. The Go compiler is smart.
    k := func() {
      x.switchFunc(nil)
    }
    k()

    // This allocates because it allocates a closure? Or the closure prevents escape analysis?
    //l := func() {
    //  x.switchFunc(a)
    //}
    //l()
  }
}

stuartcarnie commented 7 years ago

It appears to be a problem with escape analysis. The following will allocate memory in the methodValue function. Note the placement of the handler variable, outside the for loop.

package methods

import (
    "fmt"
    "testing"
    "math/rand"
)

type foo struct {
    s string
}

func (f *foo) handle1(v string) bool {
    fmt.Println(v)
    return true
}

func (f *foo) handle2(v string) bool {
    fmt.Println(f.s + v)
    return true
}

func (f *foo) methodValue(i int) {
    var handler func(string) bool
    for {
        switch i {
        case 0:
            handler = f.handle1

        case 1:
            handler = f.handle2

        default:
            panic("bad")
        }

        handler("hello")
        return
    }
}

func TestFooMethodValue(t *testing.T) {
    f := &foo{}
    f.methodValue(rand.Int() & 1)
}

Move the handler variable inside the for loop, and it no longer allocates.

func (f *foo) methodValue(i int) {
    for {
        var handler func(string) bool
        switch i {
        case 0:
            handler = f.handle1

        case 1:
            handler = f.handle2

        default:
            panic("bad")
        }

        handler("hello")
        return
    }
}

stuartcarnie commented 7 years ago

running go build -gcflags '-m' demonstrates that escape analysis is failing when capturing the Method Value. Note that in the explanation of a method value I have emphasized what I was explaining about a copy of that method value (the newobject call in the disassembly).

If the expression x has static type T and M is in the method set of type T, x.M is called a method value. The method value x.M is a function value that is callable with the same arguments as a method call of x.M. The expression x is evaluated and saved during the evaluation of the method value; the saved copy is then used as the receiver in any calls, which may be executed later.

Escape analysis failed to determine that this copy did not escape, thus the newobject call on these lines.

Here is an output of the escape analysis when the handler value is outside the for loop, demonstrating the compile is creating a copy of the method value.

) go build -gcflags '-m' ./cmd/methods
# github.com/stuartcarnie/play/cmd/methods
... lines removed for brevity
cmd/methods/method.go:22:29: leaking param: f
cmd/methods/method.go:27:15: f.handle1 escapes to heap  <- here capturing the method value of foo
cmd/methods/method.go:30:15: f.handle2 escapes to heap  <- here capturing the method value of foo
cmd/methods/method.go:42:12: &foo literal escapes to heap
... lines removed for brevity

after I move the handler var into the loop, it no longer escapes

) go build -gcflags '-m' ./cmd/methods
# github.com/stuartcarnie/play/cmd/methods
... lines removed for brevity
cmd/methods/method.go:22:29: leaking param: f
cmd/methods/method.go:27:15: (*foo).methodValue f.handle1 does not escape
cmd/methods/method.go:30:15: (*foo).methodValue f.handle2 does not escape
cmd/methods/method.go:42:12: &foo literal escapes to heap
... lines removed for brevity

erikdubbelboer commented 7 years ago

I see you were right, I was looking at the wrong allocation. Your way indeed saves one allocation, but I still think writing out the full switch is better. Your new solution still suffers from the issue where it can't do proper escape analysis on hdr making it always escape. I haven't look at the code to see if other functions make hdr escape anyways. But I think having a couple of lines code less is not worth it when it prevent escape analysis from working properly.

(The original code also prevented proper escape analysis on hdr)

stuartcarnie commented 7 years ago

In this case we want hdr allocated on the heap so it is reused for each iteration of the loop, to avoid unnecessary allocations :)

erikdubbelboer commented 7 years ago

If it's on the stack it can still be reused since the frame for recvLoop will stay on the stack. But I guess it doesn't matter that much since it's only one allocation per connection.

preetapan commented 7 years ago

@stuartcarnie Thanks for the patch. I tried it out, with a small change to the large benchmark to send 512MB rather than 100MB. Pasting results below. Overall, this is looking really good. I'll note that I didn't quite see a 15% speedup, it was more around 8%. Memory usage efficiency gains look really good, it went from allocating 7MB to only needing to allocate 0.3MB, to send 512MB.

Master branch:

preetha@preetha-work ~/go/src/github.com/hashicorp/yamux (master) $go test -bench=. -benchmem
2017/10/04 17:26:55 [WARN] yamux: failed to send ping reply: connection write timeout
goos: linux
goarch: amd64
pkg: github.com/hashicorp/yamux
BenchmarkPing-4               200000          7455 ns/op         853 B/op         16 allocs/op
BenchmarkAccept-4             100000         11425 ns/op        2133 B/op         33 allocs/op
BenchmarkSendRecv-4           500000          3846 ns/op         289 B/op          6 allocs/op
BenchmarkSendRecvLarge-4          20     107960879 ns/op     7730291 B/op      20526 allocs/op
PASS
ok      github.com/hashicorp/yamux  14.524s

Your changes:

preetha@preetha-work ~/go/src/github.com/hashicorp/yamux (perf) $go test -bench=. -benchmem
2017/10/04 17:28:13 [WARN] yamux: failed to send ping reply: connection write timeout
goos: linux
goarch: amd64
pkg: github.com/hashicorp/yamux
BenchmarkPing-4               200000          6416 ns/op         624 B/op         11 allocs/op
BenchmarkAccept-4             200000          9879 ns/op        1313 B/op         18 allocs/op
BenchmarkSendRecv-4           500000          3524 ns/op          80 B/op          2 allocs/op
BenchmarkSendRecvLarge-4          20      99652741 ns/op      337417 B/op       7412 allocs/op

I've got a couple of questions in the code review, other than that this is looking good.

stuartcarnie commented 7 years ago

If you sample the recvBuf.Cap() after receiving 100MB, you will see that it is almost 100MB (before the change). After the change, it is closer to the maximum recv window size.

preetapan commented 7 years ago

@stuartcarnie thanks for the explanation. I pushed just the benchmark to master with some debug output. I saw the recvBuf grow up to 16MB when sending 512MB, and yes indeed with your changes it caps out at 2X the window size or 0.5mb.

stuartcarnie commented 7 years ago

I am closing this PR as a number of additional commits have unintentionally been pulled in

erikdubbelboer commented 7 years ago

Are you going to open a new PR? These changes seem very useful.

jefferai commented 6 years ago

@stuartcarnie There is definite interest in these changes -- it should be easy enough to force push this branch and put the other changes into other PRs. Are you planning on following up?

preetapan commented 6 years ago

@jefferai @erikdubbelboer @stuartcarnie I made a new PR that includes the relevant changes (with commit credits to @stuartcarnie ) https://github.com/hashicorp/yamux/pull/53

stuartcarnie commented 6 years ago

thanks @preetapan

hashicorp / yamux

Fix window update, reduce memory utilization and improve performance #50