Pool writeWork packets in parallel WriteTo

greatroar commented 3 years ago

This solves the common problem of []byte in sync.Pool causing many tiny allocations (and GC pressure), by instead pooling the writeWork structures. Benchmark results on Linux/amd64:

name                       old time/op    new time/op    delta
WriteTo1k-8                  12.7ms ± 2%    12.6ms ± 2%     ~     (p=0.436 n=10+10)
WriteTo16k-8                 12.6ms ± 3%    12.6ms ± 2%     ~     (p=0.971 n=10+10)
WriteTo32k-8                 12.6ms ± 3%    12.7ms ± 4%     ~     (p=0.912 n=10+10)
WriteTo128k-8                12.5ms ± 3%    12.5ms ± 2%     ~     (p=0.796 n=10+10)
WriteTo512k-8                12.6ms ± 2%    12.5ms ± 3%     ~     (p=0.063 n=10+10)
WriteTo1MiB-8                12.6ms ± 3%    12.6ms ± 2%     ~     (p=0.853 n=10+10)
WriteTo4MiB-8                12.6ms ± 4%    12.6ms ± 4%     ~     (p=0.912 n=10+10)
WriteTo4MiBDelay10Msec-8      109ms ± 1%     108ms ± 0%     ~     (p=0.065 n=10+9)
WriteTo4MiBDelay50Msec-8      508ms ± 0%     510ms ± 0%   +0.21%  (p=0.000 n=9+10)
WriteTo4MiBDelay150Msec-8     1.51s ± 0%     1.51s ± 0%     ~     (p=1.000 n=9+10)

name                       old speed      new speed      delta
WriteTo1k-8                 828MB/s ± 2%   831MB/s ± 2%     ~     (p=0.436 n=10+10)
WriteTo16k-8                833MB/s ± 3%   833MB/s ± 2%     ~     (p=0.971 n=10+10)
WriteTo32k-8                828MB/s ± 2%   829MB/s ± 3%     ~     (p=0.604 n=9+10)
WriteTo128k-8               839MB/s ± 3%   838MB/s ± 2%     ~     (p=0.796 n=10+10)
WriteTo512k-8               831MB/s ± 2%   842MB/s ± 3%     ~     (p=0.063 n=10+10)
WriteTo1MiB-8               832MB/s ± 3%   833MB/s ± 2%     ~     (p=0.853 n=10+10)
WriteTo4MiB-8               830MB/s ± 4%   834MB/s ± 4%     ~     (p=0.912 n=10+10)
WriteTo4MiBDelay10Msec-8   96.5MB/s ± 1%  96.7MB/s ± 0%     ~     (p=0.061 n=10+9)
WriteTo4MiBDelay50Msec-8   20.6MB/s ± 0%  20.6MB/s ± 0%   -0.21%  (p=0.000 n=9+10)
WriteTo4MiBDelay150Msec-8  6.94MB/s ± 0%  6.94MB/s ± 0%     ~     (all equal)

name                       old alloc/op   new alloc/op   delta
WriteTo1k-8                  15.8MB ± 0%    15.8MB ± 0%     ~     (p=0.853 n=10+10)
WriteTo16k-8                 15.9MB ± 0%    15.8MB ± 0%     ~     (p=0.083 n=10+8)
WriteTo32k-8                 15.8MB ± 0%    15.8MB ± 0%     ~     (p=0.400 n=10+9)
WriteTo128k-8                15.8MB ± 0%    15.8MB ± 0%     ~     (p=1.000 n=10+10)
WriteTo512k-8                15.8MB ± 0%    15.8MB ± 0%     ~     (p=0.673 n=8+9)
WriteTo1MiB-8                15.8MB ± 0%    15.8MB ± 0%     ~     (p=0.912 n=10+10)
WriteTo4MiB-8                15.9MB ± 0%    15.9MB ± 0%     ~     (p=0.447 n=9+10)
WriteTo4MiBDelay10Msec-8     18.8MB ± 0%    18.8MB ± 0%     ~     (p=0.393 n=10+10)
WriteTo4MiBDelay50Msec-8     32.1MB ± 0%    32.1MB ± 0%     ~     (p=0.393 n=10+10)
WriteTo4MiBDelay150Msec-8    48.8MB ± 0%    48.8MB ± 0%     ~     (p=1.000 n=7+10)

name                       old allocs/op  new allocs/op  delta
WriteTo1k-8                   2.71k ± 0%     2.39k ± 0%  -11.81%  (p=0.000 n=10+10)
WriteTo16k-8                  2.71k ± 0%     2.39k ± 0%  -11.75%  (p=0.000 n=10+9)
WriteTo32k-8                  2.71k ± 0%     2.39k ± 0%  -11.79%  (p=0.000 n=10+10)
WriteTo128k-8                 2.71k ± 0%     2.39k ± 0%  -11.81%  (p=0.000 n=10+10)
WriteTo512k-8                 2.71k ± 0%     2.39k ± 0%  -11.83%  (p=0.000 n=10+10)
WriteTo1MiB-8                 2.71k ± 0%     2.39k ± 0%  -11.80%  (p=0.000 n=10+10)
WriteTo4MiB-8                 2.71k ± 0%     2.39k ± 0%  -11.79%  (p=0.000 n=9+9)
WriteTo4MiBDelay10Msec-8      3.11k ± 0%     2.78k ± 0%  -10.45%  (p=0.000 n=9+10)
WriteTo4MiBDelay50Msec-8      3.13k ± 1%     2.81k ± 2%  -10.26%  (p=0.000 n=9+10)
WriteTo4MiBDelay150Msec-8     3.14k ± 0%     2.87k ± 1%   -8.85%  (p=0.000 n=8+10)

puellanivis commented 3 years ago

I was thinking about how we could solve the underlying issue: a sync.Pool of []byte does not perform well. The use of pooling the writeWork is a good idea, although this particular implementation is a little clunky (to be fair, I also could not find a better way to do it.)

It occurred to me that we’re storing a very limited amount of buffers, and they’re all a consistent size. Knowing this, we can actually save even more allocs and time by just implementing a static pool ourselves. This better drop-in-replacement similar to the sync.Pool means I was able to fix not just this WriteTo implementation but the ReadFrom implementation as well, which looked to be even more chunky if we wanted to store more than just the []byte.

greatroar commented 3 years ago

Closing in favor of #417.

pkg / sftp

Pool writeWork packets in parallel WriteTo #415