rapier1 / hpn-ssh

HPN-SSH based on OpenSSH
https://psc.edu/hpn-ssh-home
Other
317 stars 42 forks source link

Results of paper "SSH Performance" #26

Closed nh2 closed 3 years ago

nh2 commented 3 years ago

I read http://allanjude.com/bsd/AsiaBSDCon2017_-_SSH_Performance.pdf and it lists some issues and suggested improvements to HPN.

Is it documented anywhere which of those were already merged / what their status is?

CC @allanjude

allanjude commented 3 years ago

I think the main thing is just the interactive check, which should go directly in upstream.

I have rebased it here:

https://github.com/allanjude/openssh-portable/tree/openssh_interactive_window

I have also rebased (but not even tried to compile test yet) the original work from 2017 here:

https://github.com/allanjude/openssh-portable/tree/bsdcan2017_rebase but have not had time to clean it up properly.

rapier1 commented 3 years ago

I haven't read this paper so thank you for point it out. As an aside, we received some new funding and are planning on a series of improvements to HPN-SSH that should position it for a 10Gb world. Hopefully. Ideas are easy but engineering is hard :) The start of this has been delayed because of other demands on my time but we hope to have some preliminary results by the end of the year.

You can get more information from: https://www.psc.edu/hpn-ssh/community-guide

allanjude commented 3 years ago

The paper might be a good start, it was tested using 40G nics in place to 10G, since most of the limits in my testing were in the 7-15 gbps range.

I think unifying the buffer sizes, and making them 32k instead of 16k, made the biggest difference in performance.

I had dtrace flamegraphs showing that I had gotten most of the time spent to be in memcpy(), rather than spent doing other things, so that helped.

rapier1 commented 3 years ago

@allanjude

I just rolled in some of the changes you had on your git repo. This includes the buffers changes and the NONE MAC. On my system it's clocking in at 150% faster than chacha20 and 30% faster than hpn-ssh with the none cipher. These are preliminary numbers but that's a notable performance boost (1600Mbps faster is nothing to sneeze at). I'm not sure about the options logic (as you can have a null mac with a legit cipher) but that can be resolved.

I have a new branch on my github called aj-extensions if you want to take a look.

I need to do some more testing to compare the impact of the buffer changes but I wanted to tell you what I'm seeing with your work.

Chris

rapier1 commented 3 years ago

@rapier1 As an aside to myself - if we are doing NONE we can probably skip the rekeying after max_packets. It literally doesn't make any sense. Shouldn't make a big difference in throughput but it's a useless operation.

rapier1 commented 3 years ago

@allanjude

I've done some more extensive testing on the changes you have proposed. I'm seeing some issues with using the 32k uniform buffers with a standard OpenSSH at higher bitrates. Specifically when I am using the None Cipher on it's own or with None Mac. I'm seeing a burst of really good throughput and then it bottoms out for seconds. I'm not entirely sure what's causing this but my assumption is that the mismatch between the incoming datagrams and the receivers buffers are causing it to drop packets all over the floor. Is there any way you could try to confirm this for me?

The buffer normalization really does help but if this is a real issue (and not just some madness on the part of my setup) I'm going to need to make this a negotiated size rather than a default.

The NoneMac is a win though. I'm working on rolling that into a new 8.4 release. I'm going to have to ensure that it can only be used in context of the NoneSwitch though. However, for testing purposes it really helps me identify the overhead that the MAC imposes. One of the goals is to push the MAC process on to a different pipeline so this will let me know if that's actually helping.

rapier1 commented 3 years ago

@rapier1,

More notes to myself. It's hard to actually disable rekeying entirely for the None cipher but I did decrease the frequency substantially. It's not making a huge difference in throughput (really within measurement error) but I'm rolling that out.

g3ntry commented 3 years ago

Which congestion control is in place for the tests? That could have this type of impact. I would recommend BBR over Cubic for all cases.

Regards, Tim

On 10/21/20 6:52 PM, Chris Rapier wrote:

@allanjude https://github.com/allanjude

I've done some more extensive testing on the changes you have proposed. I'm seeing some issues with using the 32k uniform buffers with a standard OpenSSH at higher bitrates. Specifically when I am using the None Cipher on it's own or with None Mac. I'm seeing a burst of really good throughput and then it bottoms out for seconds. I'm not entirely sure what's causing this but my assumption is that the mismatch between the incoming datagrams and the receivers buffers are causing it to drop packets all over the floor. Is there any way you could try to confirm this for me?

The buffer normalization really does help but if this is a real issue (and not just some madness on the part of my setup) I'm going to need to make this a negotiated size rather than a default.

The NoneMac is a win though. I'm working on rolling that into a new 8.4 release. I'm going to have to ensure that it can only be used in context of the NoneSwitch though. However, for testing purposes it really helps me identify the overhead that the MAC imposes. One of the goals is to push the MAC process on to a different pipeline so this will let me know if that's actually helping.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/rapier1/openssh-portable/issues/26#issuecomment-713923290, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIYYVWHK5M3ET6NDRFBCNN3SL5Q3PANCNFSM4RXT4HZQ.

rapier1 commented 3 years ago

@g3ntry My default is BBR for the test bed I have running. I've also tried this with HTCP and Cubic just for the sake of completeness. What I'm seeing looks like some sort of pausing inside during the first 2 to 4 GB of data transferred. During longer transfers this pause averages out. However, during shorter transfers the impact of this is pretty clear.

For example, during a 100GB transfer I'm averaging 640MB/s. However, for a 3GB transfer the speeds range from 250MB/s to 500MB/s. I'm only seeing this when I'm sending to a buffer optimized sshd from a non-optimized client. This very well could be an issue with my setup. I'm going to be conducting more tests using some different hosts including some that I know will be resource constrained. Hopefully I'll find out the problem is all on my end.

rapier1 commented 3 years ago

These are the results from a matrix of tests between different versions of hpnssh with the suggestions from @allanjude. bufnone = buffer normalization with none mac buftest = buffer normalization hpnssh = base hpnssh with default cipher hpnsshnone = base hpnssh with none cipher nonemac = none mac (no buffer changes) ssh = stock ssh All values are in MB/s. Any entry with an 'x' indicates an incompatible set of options. All values are the average of 40 iterations of a 15GiB 'dd if=/dev/zero' pipe /dev/null via ssh. I'll be doing more statistical analysis soon (std deviation, mode, median, p value, etc). Source is an Intel Xeon CPU X5675 @ 3.07GHz (6 cores, 12 threads) Sink is an Intel Core i7-2600K CPU @ 3.40GHz (4 cores, 8 threads) The test network is 10GB DAC through a mikrotik 10GB switch. 0.208ms avg RTT. 254KB BDP

I will be rerunning the tests reversing the source and sink. Later tests will also include increasing the RTT and using a 6 core ARM system (Max tput of ~6Gbs).

                bufnone buftest hpnssh  hpnsshnone      nonemac ssh
bufnone         890.225 x   x   x               859.75  x
buftest         648.05  646.85  630.875 624.675         624.55  x
hpnssh          338     337.225 330.275 331.875         330.2   233.025
hpnsshnone  617.875 614.2   599.125 600.7           592.65  x
nonemac         842.825 x   x   x               789.85  x
ssh             334.75  336.6   328.9   329.025         327.65  230.3

So the results look good at this point. Obviously the none mac makes a big difference. The buffer normalization is also making a difference - around 5% in this test bed. Assuming the other tests don't show any major issues I'll be incorporating them in to 8.4 sometime next week.