Closed greatroar closed 3 years ago
The test failure is the one already fixed in #387.
I think what I would more ideally like to see, is things to marshal into a slice that has space for the length already in it, but just unset. Then before writing it, the length is then written into that header, and then written as a whole.
The sync.Pool
work being done here probably isn’t the correct solution either. Returning a pointer to an array is really weird when the array is already has pointer semantics. And in fact, on a 64-bit machine, we end up with just as much space using the pointer to an array as we would using make([]byte, 4)
:
unsafe.Sizeof(uintptr) /* pointer to */ + unsafe.Sizeof(uintptr) + unsafe.Sizeof(int) /* array header */ + 4*unsafe.Sizeof(byte) = 28
unsafe.Sizeof(reflect.SliceHeader) + 4*unsafe.Sizeof(byte) = 28
I think that probably everything here would still continue to work if we break up all of these writes, I have been bitten before by weird segmentation issues where Write
s are not being done as an whole together. There’s likely already still buffers pulling the various writes together into a single transmission, since writing a bunch of half-packets would be quite inefficient… (Nagle’s Algorithm, etc)
I think what I would more ideally like to see, is things to marshal into a slice that has space for the length already in it, but just unset. Then before writing it, the length is then written into that header, and then written as a whole.
The code in this PR is not as elegant as it could be, but it's relatively local. I can try to implement a more thorough solution, but it will (1) be a much larger patch, and (2) the packet and the payload really need to be separated to get an O(1) sendPacket rather than the current O(n).
Returning a pointer to an array is really weird when the array is already has pointer semantics.
I think you're confusing arrays and slices. A [4]byte is just four bytes, without a slice header, and with value semantics. When you assign one to an interface{}, it gets copied to the heap. Try
func BenchmarkArrayPool(b *testing.B) {
pool := sync.Pool{New: func() interface{} { return [4]byte{} }}
for i := 0; i < b.N; i++ {
a := pool.Get().([4]byte)
pool.Put(a) // array copied to the heap
}
}
func BenchmarkPointerPool(b *testing.B) {
pool := sync.Pool{New: func() interface{} { return new([4]byte) }}
for i := 0; i < b.N; i++ {
a := pool.Get().(*[4]byte)
pool.Put(a)
}
}
func BenchmarkSlicePool(b *testing.B) {
pool := sync.Pool{New: func() interface{} { return make([]byte, 4) }}
for i := 0; i < b.N; i++ {
a := pool.Get().([]byte)
pool.Put(a) // slice header copied to the heap
}
}
On my machine, that gives
BenchmarkArrayPool-8 28204717 42.4 ns/op 4 B/op 1 allocs/op
BenchmarkPointerPool-8 69263575 17.3 ns/op 0 B/op 0 allocs/op
BenchmarkSlicePool-8 19285311 86.2 ns/op 32 B/op 1 allocs/op
I have been bitten before by weird segmentation issues where Writes are not being done as an whole together.
With this library? Because I don't see how the SFTP protocol could be affected.
So, I’ve played around with this a bit, and came up with the commits in https://github.com/pkg/sftp/tree/more-optimization (also included some various subtle race conditions being dealt with as well), by just allocating and starting after the length, it’s a lot of lines-of-code-change, but simpler in the end without a need for a sync.Pool
. (Oh, and I found a few incorrect length calculations, that could have :woman_shrugging: been causing reallocations? Results unclear, go 1.15 at least seems to be allocating with some buffer room sometimes.)
The other improvements were mostly with an eye towards improving the client.go
code as well… but the results of my first go are… hazy about how good they might be. They’re significantly better under write delay, but only “meh” at best without… (more concurrency == more overhead, but also better responsiveness per channel when slow). And does ~20 kB/op really matter, when both maybe-solutions can get about the same throughput, and we‘re already cutting out 100s of MB/op? :woman_shrugging:
I’ll have something tomorrow either way, either the code sitting uncommitted on my laptop, or a restart that isn’t overhauling client.go
as much.
I've benchmarked your branch @ 113f5e9b80e37152e3412ee9a2d6fbd8c61e9b37 against master:
name old time/op new time/op delta
Write1k-8 312ms ±12% 311ms ± 6% ~ (p=0.796 n=10+10)
Write16k-8 56.9ms ± 2% 32.7ms ± 4% -42.62% (p=0.000 n=10+10)
Write32k-8 43.2ms ± 3% 17.6ms ± 5% -59.17% (p=0.000 n=10+10)
Write128k-8 31.1ms ± 2% 14.7ms ± 8% -52.77% (p=0.000 n=10+10)
Write512k-8 17.5ms ± 5% 10.2ms ±10% -41.66% (p=0.000 n=10+10)
Write1MiB-8 14.9ms ± 1% 9.2ms ± 3% -38.19% (p=0.000 n=8+10)
name old speed new speed delta
Write1k-8 33.7MB/s ±11% 33.8MB/s ± 6% ~ (p=0.796 n=10+10)
Write16k-8 184MB/s ± 2% 321MB/s ± 4% +74.35% (p=0.000 n=10+10)
Write32k-8 243MB/s ± 3% 595MB/s ± 5% +145.04% (p=0.000 n=10+10)
Write128k-8 337MB/s ± 2% 715MB/s ± 8% +112.43% (p=0.000 n=10+10)
Write512k-8 599MB/s ± 5% 1028MB/s ±10% +71.65% (p=0.000 n=10+10)
Write1MiB-8 702MB/s ± 1% 1136MB/s ± 3% +61.84% (p=0.000 n=8+10)
name old alloc/op new alloc/op delta
Write1k-8 61.6MB ± 0% 35.7MB ± 0% -42.02% (p=0.000 n=10+10)
Write16k-8 27.9MB ± 0% 2.2MB ± 0% -91.99% (p=0.000 n=10+10)
Write32k-8 29.9MB ± 0% 1.1MB ± 0% -96.26% (p=0.000 n=10+9)
Write128k-8 29.2MB ± 0% 0.3MB ± 0% -98.87% (p=0.000 n=10+9)
Write512k-8 29.0MB ± 0% 0.1MB ± 0% -99.55% (p=0.000 n=10+10)
Write1MiB-8 28.9MB ± 0% 0.1MB ± 0% -99.66% (p=0.000 n=10+10)
name old allocs/op new allocs/op delta
Write1k-8 92.2k ± 0% 82.0k ± 0% -11.12% (p=0.000 n=9+10)
Write16k-8 5.81k ± 0% 5.16k ± 0% -11.23% (p=0.000 n=10+9)
Write32k-8 2.93k ± 0% 2.60k ± 0% -11.35% (p=0.000 n=10+10)
Write128k-8 2.45k ± 0% 2.12k ± 0% -13.56% (p=0.002 n=8+10)
Write512k-8 2.33k ± 0% 2.00k ± 0% -14.30% (p=0.000 n=9+9)
Write1MiB-8 2.31k ± 0% 1.98k ± 0% -14.42% (p=0.000 n=6+9)
And against this PR (new = yours):
name old time/op new time/op delta
Write1k-8 326ms ± 7% 311ms ± 6% -4.54% (p=0.043 n=10+10)
Write16k-8 35.5ms ± 7% 32.7ms ± 4% -7.93% (p=0.000 n=10+10)
Write32k-8 19.8ms ± 5% 17.6ms ± 5% -10.89% (p=0.000 n=10+10)
Write128k-8 14.9ms ± 4% 14.7ms ± 8% ~ (p=0.971 n=10+10)
Write512k-8 10.4ms ± 4% 10.2ms ±10% ~ (p=0.912 n=10+10)
Write1MiB-8 9.33ms ± 1% 9.23ms ± 3% ~ (p=0.182 n=9+10)
Write4MiB-8 8.86ms ± 7% 8.81ms ± 4% ~ (p=0.481 n=10+10)
name old speed new speed delta
Write1k-8 32.3MB/s ± 7% 33.8MB/s ± 6% +4.68% (p=0.045 n=10+10)
Write16k-8 296MB/s ± 7% 321MB/s ± 4% +8.53% (p=0.000 n=10+10)
Write32k-8 530MB/s ± 5% 595MB/s ± 5% +12.20% (p=0.000 n=10+10)
Write128k-8 702MB/s ± 4% 715MB/s ± 8% ~ (p=0.971 n=10+10)
Write512k-8 1.01GB/s ± 4% 1.03GB/s ±10% ~ (p=0.912 n=10+10)
Write1MiB-8 1.12GB/s ± 1% 1.14GB/s ± 3% ~ (p=0.182 n=9+10)
Write4MiB-8 1.17GB/s ± 5% 1.19GB/s ± 4% ~ (p=0.243 n=9+10)
name old alloc/op new alloc/op delta
Write1k-8 35.7MB ± 0% 35.7MB ± 0% -0.01% (p=0.000 n=9+10)
Write16k-8 2.24MB ± 0% 2.24MB ± 0% -0.03% (p=0.000 n=10+10)
Write32k-8 1.12MB ± 0% 1.12MB ± 0% -0.05% (p=0.000 n=10+9)
Write128k-8 330kB ± 0% 330kB ± 0% -0.17% (p=0.000 n=10+9)
Write512k-8 132kB ± 0% 131kB ± 0% -0.35% (p=0.000 n=10+10)
Write1MiB-8 98.6kB ± 0% 98.1kB ± 0% -0.49% (p=0.000 n=10+10)
Write4MiB-8 72.1kB ± 0% 71.6kB ± 0% -0.67% (p=0.000 n=10+10)
name old allocs/op new allocs/op delta
Write1k-8 82.0k ± 0% 82.0k ± 0% -0.03% (p=0.000 n=10+10)
Write16k-8 5.17k ± 0% 5.16k ± 0% -0.18% (p=0.000 n=10+9)
Write32k-8 2.60k ± 0% 2.60k ± 0% -0.35% (p=0.000 n=9+10)
Write128k-8 2.12k ± 0% 2.12k ± 0% -0.40% (p=0.000 n=10+10)
Write512k-8 2.00k ± 0% 2.00k ± 0% -0.45% (p=0.000 n=9+9)
Write1MiB-8 1.99k ± 0% 1.98k ± 0% -0.45% (p=0.000 n=10+9)
Write4MiB-8 1.97k ± 0% 1.96k ± 0% -0.46% (p=0.000 n=10+10)
That's quite a bit better for small writes!
I’ve pushed another commit to the branch with the alternate concurrency pattern. Ideally benchmarking them with the 4MiB with Delays should demonstrate the better concurrency performance.
Perhaps your brenchmark tool output can show some better statistical data than I get with my single runs. Get an idea of if this commit would be better to just back out.
Here's new=113f5e9b80e37152e3412ee9a2d6fbd8c61e9b37 against old=2b85c58483d85114825b740918f7fed96cf330cf (the delayed benchmarks won't run on master):
name old time/op new time/op delta
Write4MiBDelay10Msec-8 260ms ± 0% 259ms ± 0% -0.40% (p=0.000 n=8+8)
Write4MiBDelay50Msec-8 1.26s ± 0% 1.26s ± 0% -0.06% (p=0.000 n=7+9)
Write4MiBDelay150Msec-8 3.76s ± 0% 3.76s ± 0% -0.05% (p=0.000 n=8+10)
name old speed new speed delta
Write4MiBDelay10Msec-8 40.4MB/s ± 0% 40.5MB/s ± 0% +0.43% (p=0.000 n=9+8)
Write4MiBDelay50Msec-8 8.32MB/s ± 0% 8.32MB/s ± 0% ~ (all equal)
Write4MiBDelay150Msec-8 2.79MB/s ± 0% 2.79MB/s ± 0% ~ (all equal)
name old alloc/op new alloc/op delta
Write4MiBDelay10Msec-8 42.0MB ± 0% 10.6MB ± 0% -74.84% (p=0.000 n=10+9)
Write4MiBDelay50Msec-8 42.0MB ± 0% 10.6MB ± 0% -74.83% (p=0.000 n=9+9)
Write4MiBDelay150Msec-8 42.0MB ± 0% 10.6MB ± 0% -74.83% (p=0.000 n=10+10)
name old allocs/op new allocs/op delta
Write4MiBDelay10Msec-8 2.61k ± 0% 2.61k ± 0% -0.13% (p=0.000 n=10+8)
Write4MiBDelay50Msec-8 2.62k ± 0% 2.62k ± 0% -0.09% (p=0.010 n=9+9)
Write4MiBDelay150Msec-8 2.62k ± 0% 2.62k ± 0% -0.08% (p=0.006 n=10+10)
With f80745d84ca72104a8f0635ae1b6d6e1a91c600d applied, this becomes (old is now 113f5e9b80e37152e3412ee9a2d6fbd8c61e9b37):
name old time/op new time/op delta
Write4MiBDelay10Msec-8 259ms ± 0% 91ms ± 1% -64.78% (p=0.000 n=8+9)
Write4MiBDelay50Msec-8 1.26s ± 0% 0.41s ± 0% -67.34% (p=0.000 n=9+10)
Write4MiBDelay150Msec-8 3.76s ± 0% 1.21s ± 0% -67.76% (p=0.000 n=10+10)
name old speed new speed delta
Write4MiBDelay10Msec-8 40.5MB/s ± 0% 115.1MB/s ± 1% +183.93% (p=0.000 n=8+9)
Write4MiBDelay50Msec-8 8.32MB/s ± 0% 25.49MB/s ± 0% +206.35% (p=0.000 n=8+10)
Write4MiBDelay150Msec-8 2.79MB/s ± 0% 8.65MB/s ± 0% +210.04% (p=0.000 n=10+7)
name old alloc/op new alloc/op delta
Write4MiBDelay10Msec-8 10.6MB ± 0% 10.6MB ± 0% +0.18% (p=0.000 n=9+9)
Write4MiBDelay50Msec-8 10.6MB ± 0% 10.6MB ± 0% +0.16% (p=0.000 n=9+9)
Write4MiBDelay150Msec-8 10.6MB ± 0% 10.6MB ± 0% +0.19% (p=0.000 n=10+10)
name old allocs/op new allocs/op delta
Write4MiBDelay10Msec-8 2.61k ± 0% 3.00k ± 0% +15.26% (p=0.000 n=8+9)
Write4MiBDelay50Msec-8 2.62k ± 0% 3.01k ± 1% +14.96% (p=0.000 n=9+9)
Write4MiBDelay150Msec-8 2.62k ± 0% 3.02k ± 1% +15.33% (p=0.000 n=10+10)
If you want to try this too, here's what I did:
$ go install golang.org/x/perf/cmd/benchstat
$ TMPDIR=/dev/shm go test -run=- -integration -bench='Write4MiBDelay' -benchmem -count=10 | tee -a log0
# similarly for log1, after checking out a different commit
$ benchstat log0 log1
Benchstat can also take more than two logs as input, but then it can't do significance testing.
Maybe the concurrent WriteAt can be simplified using golang.org/x/sync/errgroup?
Thanks. I’m already eyeing the WriteAt and ReadFrom. But like I said, while the delayed writes and reads are way faster, the non-delayed ones are where the “is it a wash enough to justify shifting so that long pipes are faster”. I mean, as well, benchmarks are only rough gauges of actual performance… the dropped memory allocation is definitely a big win what with multiple factors of 10 being dropped, but the throughput at this stage is… hazy.
Thanks for the link to run the benchmarks myself, though. I might check it out a bit more over the weekend or the coming week. Get a feel for what the right direction to move forward is. (Even if that’s throwing out the concurrency model change I just wrote. :laughing: )
Ideas were wrapped into https://github.com/pkg/sftp/pull/397 thanks @greatroar for the contribution! I’m sure we will all see much better performance as a result. (Even if it took me a while to get around to getting it all done.)
This patch greatly improves the throughput of sendPacket by avoiding a copy of each packet before it is sent, and by avoiding copying the payload of Write and Data packets at all.
Benchmarked against OpenSSH using
on Linux/amd64. Results:
I had to fix the benchmark to get this to work; the 4MiB version still hangs indefinitely.