Write length and payload separately for double the throughput

greatroar commented 4 years ago

This patch greatly improves the throughput of sendPacket by avoiding a copy of each packet before it is sent, and by avoiding copying the payload of Write and Data packets at all.

Benchmarked against OpenSSH using

TMPDIR=/dev/shm go test -run=- -integration -bench='kWrite([0-9]*k|1MiB)' -benchmem

on Linux/amd64. Results:

name         old time/op    new time/op    delta
Write1k-8       312ms ±12%     305ms ± 5%      ~     (p=0.684 n=10+10)
Write16k-8     56.9ms ± 2%    31.9ms ± 4%   -43.92%  (p=0.000 n=10+10)
Write32k-8     43.2ms ± 3%    18.1ms ± 6%   -58.22%  (p=0.000 n=10+10)
Write128k-8    31.1ms ± 2%    13.5ms ± 4%   -56.72%  (p=0.000 n=10+9)
Write512k-8    17.5ms ± 5%    10.1ms ± 4%   -42.30%  (p=0.000 n=10+10)
Write1MiB-8    14.9ms ± 1%     9.3ms ± 4%   -37.72%  (p=0.000 n=8+10)

name         old speed      new speed      delta
Write1k-8    33.7MB/s ±11%  34.4MB/s ± 5%      ~     (p=0.684 n=10+10)
Write16k-8    184MB/s ± 2%   329MB/s ± 4%   +78.39%  (p=0.000 n=10+10)
Write32k-8    243MB/s ± 3%   581MB/s ± 6%  +139.45%  (p=0.000 n=10+10)
Write128k-8   337MB/s ± 2%   778MB/s ± 4%  +131.11%  (p=0.000 n=10+9)
Write512k-8   599MB/s ± 5%  1038MB/s ± 4%   +73.29%  (p=0.000 n=10+10)
Write1MiB-8   702MB/s ± 1%  1128MB/s ± 4%   +60.67%  (p=0.000 n=8+10)

name         old alloc/op   new alloc/op   delta
Write1k-8      61.6MB ± 0%    35.7MB ± 0%   -42.01%  (p=0.000 n=10+10)
Write16k-8     27.9MB ± 0%     2.2MB ± 0%   -91.99%  (p=0.000 n=10+10)
Write32k-8     29.9MB ± 0%     1.1MB ± 0%   -96.26%  (p=0.000 n=10+7)
Write128k-8    29.2MB ± 0%     0.3MB ± 0%   -98.87%  (p=0.000 n=10+10)
Write512k-8    29.0MB ± 0%     0.1MB ± 0%   -99.55%  (p=0.000 n=10+10)
Write1MiB-8    28.9MB ± 0%     0.1MB ± 0%   -99.66%  (p=0.000 n=10+10)

name         old allocs/op  new allocs/op  delta
Write1k-8       92.2k ± 0%     82.0k ± 0%   -11.09%  (p=0.000 n=9+9)
Write16k-8      5.81k ± 0%     5.17k ± 0%   -11.07%  (p=0.000 n=10+10)
Write32k-8      2.93k ± 0%     2.60k ± 0%   -11.04%  (p=0.000 n=10+9)
Write128k-8     2.45k ± 0%     2.12k ± 0%   -13.21%  (p=0.000 n=8+10)
Write512k-8     2.33k ± 0%     2.00k ± 0%   -13.91%  (p=0.000 n=9+9)
Write1MiB-8     2.31k ± 0%     1.99k ± 0%   -14.03%  (p=0.000 n=6+10)

I had to fix the benchmark to get this to work; the 4MiB version still hangs indefinitely.

greatroar commented 4 years ago

The test failure is the one already fixed in #387.

puellanivis commented 4 years ago

I think what I would more ideally like to see, is things to marshal into a slice that has space for the length already in it, but just unset. Then before writing it, the length is then written into that header, and then written as a whole.

The sync.Pool work being done here probably isn’t the correct solution either. Returning a pointer to an array is really weird when the array is already has pointer semantics. And in fact, on a 64-bit machine, we end up with just as much space using the pointer to an array as we would using make([]byte, 4):

unsafe.Sizeof(uintptr) /* pointer to */ + unsafe.Sizeof(uintptr) + unsafe.Sizeof(int) /* array header */ + 4*unsafe.Sizeof(byte) = 28
unsafe.Sizeof(reflect.SliceHeader) + 4*unsafe.Sizeof(byte) = 28

I think that probably everything here would still continue to work if we break up all of these writes, I have been bitten before by weird segmentation issues where Writes are not being done as an whole together. There’s likely already still buffers pulling the various writes together into a single transmission, since writing a bunch of half-packets would be quite inefficient… (Nagle’s Algorithm, etc)

greatroar commented 4 years ago

I think what I would more ideally like to see, is things to marshal into a slice that has space for the length already in it, but just unset. Then before writing it, the length is then written into that header, and then written as a whole.

The code in this PR is not as elegant as it could be, but it's relatively local. I can try to implement a more thorough solution, but it will (1) be a much larger patch, and (2) the packet and the payload really need to be separated to get an O(1) sendPacket rather than the current O(n).

Returning a pointer to an array is really weird when the array is already has pointer semantics.

I think you're confusing arrays and slices. A [4]byte is just four bytes, without a slice header, and with value semantics. When you assign one to an interface{}, it gets copied to the heap. Try

func BenchmarkArrayPool(b *testing.B) {
    pool := sync.Pool{New: func() interface{} { return [4]byte{} }}

    for i := 0; i < b.N; i++ {
        a := pool.Get().([4]byte)
        pool.Put(a) // array copied to the heap
    }
}

func BenchmarkPointerPool(b *testing.B) {
    pool := sync.Pool{New: func() interface{} { return new([4]byte) }}

    for i := 0; i < b.N; i++ {
        a := pool.Get().(*[4]byte)
        pool.Put(a)
    }
}

func BenchmarkSlicePool(b *testing.B) {
    pool := sync.Pool{New: func() interface{} { return make([]byte, 4) }}

    for i := 0; i < b.N; i++ {
        a := pool.Get().([]byte)
        pool.Put(a) // slice header copied to the heap
    }
}

On my machine, that gives

BenchmarkArrayPool-8        28204717            42.4 ns/op         4 B/op          1 allocs/op
BenchmarkPointerPool-8      69263575            17.3 ns/op         0 B/op          0 allocs/op
BenchmarkSlicePool-8        19285311            86.2 ns/op        32 B/op          1 allocs/op

I have been bitten before by weird segmentation issues where Writes are not being done as an whole together.

With this library? Because I don't see how the SFTP protocol could be affected.

puellanivis commented 4 years ago

So, I’ve played around with this a bit, and came up with the commits in https://github.com/pkg/sftp/tree/more-optimization (also included some various subtle race conditions being dealt with as well), by just allocating and starting after the length, it’s a lot of lines-of-code-change, but simpler in the end without a need for a sync.Pool. (Oh, and I found a few incorrect length calculations, that could have :woman_shrugging: been causing reallocations? Results unclear, go 1.15 at least seems to be allocating with some buffer room sometimes.)

The other improvements were mostly with an eye towards improving the client.go code as well… but the results of my first go are… hazy about how good they might be. They’re significantly better under write delay, but only “meh” at best without… (more concurrency == more overhead, but also better responsiveness per channel when slow). And does ~20 kB/op really matter, when both maybe-solutions can get about the same throughput, and we‘re already cutting out 100s of MB/op? :woman_shrugging:

I’ll have something tomorrow either way, either the code sitting uncommitted on my laptop, or a restart that isn’t overhauling client.go as much.

greatroar commented 4 years ago

I've benchmarked your branch @ 113f5e9b80e37152e3412ee9a2d6fbd8c61e9b37 against master:

name         old time/op    new time/op    delta
Write1k-8       312ms ±12%     311ms ± 6%      ~     (p=0.796 n=10+10)
Write16k-8     56.9ms ± 2%    32.7ms ± 4%   -42.62%  (p=0.000 n=10+10)
Write32k-8     43.2ms ± 3%    17.6ms ± 5%   -59.17%  (p=0.000 n=10+10)
Write128k-8    31.1ms ± 2%    14.7ms ± 8%   -52.77%  (p=0.000 n=10+10)
Write512k-8    17.5ms ± 5%    10.2ms ±10%   -41.66%  (p=0.000 n=10+10)
Write1MiB-8    14.9ms ± 1%     9.2ms ± 3%   -38.19%  (p=0.000 n=8+10)

name         old speed      new speed      delta
Write1k-8    33.7MB/s ±11%  33.8MB/s ± 6%      ~     (p=0.796 n=10+10)
Write16k-8    184MB/s ± 2%   321MB/s ± 4%   +74.35%  (p=0.000 n=10+10)
Write32k-8    243MB/s ± 3%   595MB/s ± 5%  +145.04%  (p=0.000 n=10+10)
Write128k-8   337MB/s ± 2%   715MB/s ± 8%  +112.43%  (p=0.000 n=10+10)
Write512k-8   599MB/s ± 5%  1028MB/s ±10%   +71.65%  (p=0.000 n=10+10)
Write1MiB-8   702MB/s ± 1%  1136MB/s ± 3%   +61.84%  (p=0.000 n=8+10)

name         old alloc/op   new alloc/op   delta
Write1k-8      61.6MB ± 0%    35.7MB ± 0%   -42.02%  (p=0.000 n=10+10)
Write16k-8     27.9MB ± 0%     2.2MB ± 0%   -91.99%  (p=0.000 n=10+10)
Write32k-8     29.9MB ± 0%     1.1MB ± 0%   -96.26%  (p=0.000 n=10+9)
Write128k-8    29.2MB ± 0%     0.3MB ± 0%   -98.87%  (p=0.000 n=10+9)
Write512k-8    29.0MB ± 0%     0.1MB ± 0%   -99.55%  (p=0.000 n=10+10)
Write1MiB-8    28.9MB ± 0%     0.1MB ± 0%   -99.66%  (p=0.000 n=10+10)

name         old allocs/op  new allocs/op  delta
Write1k-8       92.2k ± 0%     82.0k ± 0%   -11.12%  (p=0.000 n=9+10)
Write16k-8      5.81k ± 0%     5.16k ± 0%   -11.23%  (p=0.000 n=10+9)
Write32k-8      2.93k ± 0%     2.60k ± 0%   -11.35%  (p=0.000 n=10+10)
Write128k-8     2.45k ± 0%     2.12k ± 0%   -13.56%  (p=0.002 n=8+10)
Write512k-8     2.33k ± 0%     2.00k ± 0%   -14.30%  (p=0.000 n=9+9)
Write1MiB-8     2.31k ± 0%     1.98k ± 0%   -14.42%  (p=0.000 n=6+9)

And against this PR (new = yours):

name         old time/op    new time/op    delta
Write1k-8       326ms ± 7%     311ms ± 6%   -4.54%  (p=0.043 n=10+10)
Write16k-8     35.5ms ± 7%    32.7ms ± 4%   -7.93%  (p=0.000 n=10+10)
Write32k-8     19.8ms ± 5%    17.6ms ± 5%  -10.89%  (p=0.000 n=10+10)
Write128k-8    14.9ms ± 4%    14.7ms ± 8%     ~     (p=0.971 n=10+10)
Write512k-8    10.4ms ± 4%    10.2ms ±10%     ~     (p=0.912 n=10+10)
Write1MiB-8    9.33ms ± 1%    9.23ms ± 3%     ~     (p=0.182 n=9+10)
Write4MiB-8    8.86ms ± 7%    8.81ms ± 4%     ~     (p=0.481 n=10+10)

name         old speed      new speed      delta
Write1k-8    32.3MB/s ± 7%  33.8MB/s ± 6%   +4.68%  (p=0.045 n=10+10)
Write16k-8    296MB/s ± 7%   321MB/s ± 4%   +8.53%  (p=0.000 n=10+10)
Write32k-8    530MB/s ± 5%   595MB/s ± 5%  +12.20%  (p=0.000 n=10+10)
Write128k-8   702MB/s ± 4%   715MB/s ± 8%     ~     (p=0.971 n=10+10)
Write512k-8  1.01GB/s ± 4%  1.03GB/s ±10%     ~     (p=0.912 n=10+10)
Write1MiB-8  1.12GB/s ± 1%  1.14GB/s ± 3%     ~     (p=0.182 n=9+10)
Write4MiB-8  1.17GB/s ± 5%  1.19GB/s ± 4%     ~     (p=0.243 n=9+10)

name         old alloc/op   new alloc/op   delta
Write1k-8      35.7MB ± 0%    35.7MB ± 0%   -0.01%  (p=0.000 n=9+10)
Write16k-8     2.24MB ± 0%    2.24MB ± 0%   -0.03%  (p=0.000 n=10+10)
Write32k-8     1.12MB ± 0%    1.12MB ± 0%   -0.05%  (p=0.000 n=10+9)
Write128k-8     330kB ± 0%     330kB ± 0%   -0.17%  (p=0.000 n=10+9)
Write512k-8     132kB ± 0%     131kB ± 0%   -0.35%  (p=0.000 n=10+10)
Write1MiB-8    98.6kB ± 0%    98.1kB ± 0%   -0.49%  (p=0.000 n=10+10)
Write4MiB-8    72.1kB ± 0%    71.6kB ± 0%   -0.67%  (p=0.000 n=10+10)

name         old allocs/op  new allocs/op  delta
Write1k-8       82.0k ± 0%     82.0k ± 0%   -0.03%  (p=0.000 n=10+10)
Write16k-8      5.17k ± 0%     5.16k ± 0%   -0.18%  (p=0.000 n=10+9)
Write32k-8      2.60k ± 0%     2.60k ± 0%   -0.35%  (p=0.000 n=9+10)
Write128k-8     2.12k ± 0%     2.12k ± 0%   -0.40%  (p=0.000 n=10+10)
Write512k-8     2.00k ± 0%     2.00k ± 0%   -0.45%  (p=0.000 n=9+9)
Write1MiB-8     1.99k ± 0%     1.98k ± 0%   -0.45%  (p=0.000 n=10+9)
Write4MiB-8     1.97k ± 0%     1.96k ± 0%   -0.46%  (p=0.000 n=10+10)

That's quite a bit better for small writes!

puellanivis commented 4 years ago

I’ve pushed another commit to the branch with the alternate concurrency pattern. Ideally benchmarking them with the 4MiB with Delays should demonstrate the better concurrency performance.

Perhaps your brenchmark tool output can show some better statistical data than I get with my single runs. Get an idea of if this commit would be better to just back out.

greatroar commented 4 years ago

Here's new=113f5e9b80e37152e3412ee9a2d6fbd8c61e9b37 against old=2b85c58483d85114825b740918f7fed96cf330cf (the delayed benchmarks won't run on master):

name                     old time/op    new time/op    delta
Write4MiBDelay10Msec-8      260ms ± 0%     259ms ± 0%   -0.40%  (p=0.000 n=8+8)
Write4MiBDelay50Msec-8      1.26s ± 0%     1.26s ± 0%   -0.06%  (p=0.000 n=7+9)
Write4MiBDelay150Msec-8     3.76s ± 0%     3.76s ± 0%   -0.05%  (p=0.000 n=8+10)

name                     old speed      new speed      delta
Write4MiBDelay10Msec-8   40.4MB/s ± 0%  40.5MB/s ± 0%   +0.43%  (p=0.000 n=9+8)
Write4MiBDelay50Msec-8   8.32MB/s ± 0%  8.32MB/s ± 0%     ~     (all equal)
Write4MiBDelay150Msec-8  2.79MB/s ± 0%  2.79MB/s ± 0%     ~     (all equal)

name                     old alloc/op   new alloc/op   delta
Write4MiBDelay10Msec-8     42.0MB ± 0%    10.6MB ± 0%  -74.84%  (p=0.000 n=10+9)
Write4MiBDelay50Msec-8     42.0MB ± 0%    10.6MB ± 0%  -74.83%  (p=0.000 n=9+9)
Write4MiBDelay150Msec-8    42.0MB ± 0%    10.6MB ± 0%  -74.83%  (p=0.000 n=10+10)

name                     old allocs/op  new allocs/op  delta
Write4MiBDelay10Msec-8      2.61k ± 0%     2.61k ± 0%   -0.13%  (p=0.000 n=10+8)
Write4MiBDelay50Msec-8      2.62k ± 0%     2.62k ± 0%   -0.09%  (p=0.010 n=9+9)
Write4MiBDelay150Msec-8     2.62k ± 0%     2.62k ± 0%   -0.08%  (p=0.006 n=10+10)

With f80745d84ca72104a8f0635ae1b6d6e1a91c600d applied, this becomes (old is now 113f5e9b80e37152e3412ee9a2d6fbd8c61e9b37):

name                     old time/op    new time/op     delta
Write4MiBDelay10Msec-8      259ms ± 0%       91ms ± 1%   -64.78%  (p=0.000 n=8+9)
Write4MiBDelay50Msec-8      1.26s ± 0%      0.41s ± 0%   -67.34%  (p=0.000 n=9+10)
Write4MiBDelay150Msec-8     3.76s ± 0%      1.21s ± 0%   -67.76%  (p=0.000 n=10+10)

name                     old speed      new speed       delta
Write4MiBDelay10Msec-8   40.5MB/s ± 0%  115.1MB/s ± 1%  +183.93%  (p=0.000 n=8+9)
Write4MiBDelay50Msec-8   8.32MB/s ± 0%  25.49MB/s ± 0%  +206.35%  (p=0.000 n=8+10)
Write4MiBDelay150Msec-8  2.79MB/s ± 0%   8.65MB/s ± 0%  +210.04%  (p=0.000 n=10+7)

name                     old alloc/op   new alloc/op    delta
Write4MiBDelay10Msec-8     10.6MB ± 0%     10.6MB ± 0%    +0.18%  (p=0.000 n=9+9)
Write4MiBDelay50Msec-8     10.6MB ± 0%     10.6MB ± 0%    +0.16%  (p=0.000 n=9+9)
Write4MiBDelay150Msec-8    10.6MB ± 0%     10.6MB ± 0%    +0.19%  (p=0.000 n=10+10)

name                     old allocs/op  new allocs/op   delta
Write4MiBDelay10Msec-8      2.61k ± 0%      3.00k ± 0%   +15.26%  (p=0.000 n=8+9)
Write4MiBDelay50Msec-8      2.62k ± 0%      3.01k ± 1%   +14.96%  (p=0.000 n=9+9)
Write4MiBDelay150Msec-8     2.62k ± 0%      3.02k ± 1%   +15.33%  (p=0.000 n=10+10)

If you want to try this too, here's what I did:

$ go install golang.org/x/perf/cmd/benchstat
$ TMPDIR=/dev/shm go test -run=- -integration -bench='Write4MiBDelay' -benchmem -count=10 | tee -a log0
# similarly for log1, after checking out a different commit
$ benchstat log0 log1

Benchstat can also take more than two logs as input, but then it can't do significance testing.

greatroar commented 4 years ago

Maybe the concurrent WriteAt can be simplified using golang.org/x/sync/errgroup?

puellanivis commented 4 years ago

Thanks. I’m already eyeing the WriteAt and ReadFrom. But like I said, while the delayed writes and reads are way faster, the non-delayed ones are where the “is it a wash enough to justify shifting so that long pipes are faster”. I mean, as well, benchmarks are only rough gauges of actual performance… the dropped memory allocation is definitely a big win what with multiple factors of 10 being dropped, but the throughput at this stage is… hazy.

Thanks for the link to run the benchmarks myself, though. I might check it out a bit more over the weekend or the coming week. Get a feel for what the right direction to move forward is. (Even if that’s throwing out the concurrency model change I just wrote. :laughing: )

puellanivis commented 3 years ago

Ideas were wrapped into https://github.com/pkg/sftp/pull/397 thanks @greatroar for the contribution! I’m sure we will all see much better performance as a result. (Even if it took me a while to get around to getting it all done.)

pkg / sftp

Write length and payload separately for double the throughput #388