rapier1 / hpn-ssh

HPN-SSH based on OpenSSH
https://psc.edu/hpn-ssh-home
Other
302 stars 41 forks source link

None cipher not being used for socket based tunnels? #80

Closed justinclift closed 2 months ago

justinclift commented 2 months ago

I'm my experimentation of using hpn-ssh with Proxmox, I have it successfully using the None cipher for standard ssh traffic between hosts. The hpn-ssh process uses about 40-50% of a cpu core (as judged by looking at htop) while that's occurring.

Something interesting is showing up though. When Proxmox migrates a virtual machine from one cluster node to another, it does so through ssh tunnels created using sockets. (first transferring disk snapshot data through one tunnel, then coping the vm memory through another)

The command it uses (before my script changes it to hpn-ssh with the none cipher):

# /usr/bin/ssh -e none -o BatchMode=yes root@SERVER2 \
    -o ExitOnForwardFailure=yes \
    -L /run/qemu-server/100_nbd.migrate:/run/qemu-server/100_nbd.migrate \
    -L /run/qemu-server/100.migrate:/run/qemu-server/100.migrate \
     /usr/sbin/qm mtunnel

The two lines there starting with -L show the creation of two ssh tunnels using unix domain sockets as the end points.

The interesting thing is that while the virtual machine migration is happening, the hpn-ssh process shoots up to take 100% of a cpu on both ends.

That's the behaviour I was seeing prior to enabling the none cipher. So I'm thinking that maybe hpn-ssh is forgetting to use the none cipher for tunnels, or something along those lines?

To be clear, in the above instance the actual command being run is:

# hpnssh -oNoneEnabled=yes -oNoneSwitch=yes -e none -o BatchMode=yes root@SERVER2 \
    -o ExitOnForwardFailure=yes \
    -L /run/qemu-server/100_nbd.migrate:/run/qemu-server/100_nbd.migrate \
    -L /run/qemu-server/100.migrate:/run/qemu-server/100.migrate \
     /usr/sbin/qm mtunnel

Any ideas? :smile:

justinclift commented 2 months ago

Btw, the way I'm finding the actual commands being run + ensuring hpn-ssh uses the none cipher, is by replacing /usr/bin/ssh with this simple script:

#!/bin/env sh

echo "ssh called using: $0 $@" >> /root/ssh_calls.log

hpnssh -oNoneEnabled=yes -oNoneSwitch=yes $@

That writes the actual commands being run (to /root/ssh_calls.log) so I can investigate things like the above.

justinclift commented 2 months ago

As an aside, it turns out Proxmox already has an insecure mode available, that can be added to one of the config files (/etc/pve/datacenter.cfg):

migration: type=insecure,network=1.2.3.0/24

With that enabled, transfers go much faster:

2024-04-20 13:37:47 migration active, transferred 88.8 GiB of 120.0 GiB VM-state, 1.8 GiB/s
2024-04-20 13:37:48 migration active, transferred 90.4 GiB of 120.0 GiB VM-state, 1.9 GiB/s
2024-04-20 13:37:49 migration active, transferred 91.9 GiB of 120.0 GiB VM-state, 2.0 GiB/s
2024-04-20 13:37:50 migration active, transferred 93.4 GiB of 120.0 GiB VM-state, 1.9 GiB/s
2024-04-20 13:37:51 migration active, transferred 95.0 GiB of 120.0 GiB VM-state, 2.0 GiB/s

Meanwhile, the kvm process on both source and destination hosts shows 100% cpu usage for a single core on each. It's still not using the entire available network bandwidth, but it's a decent improvement.

Probably the only practical approach for improving upon that in a meaningful way would be to implement parallel streams for the transfer.

That kind of thing would be useful for secure (ssh) migrations as well, so it doesn't matter if a single tunnel is capped at (say) ~500MB/s. Just (heh) use more tunnels.

Probably a large can or worms to open though, and I'm not bothered enough to try and implement it myself in Proxmox. That seems to be written in Perl, which isn't my thing. :wink:

rapier1 commented 2 months ago

Hey, sorry about the delay is responding. So the none cipher should only be engaging when there is no TTY being allocated. If you try to use the none cipher when a TTY is allocated you should be seeing a warning saying "NONE cipher switched disabled when a TTY is allocated" That warning goes to STDERR so you might not see it if the tunnels repress or redirect STDERR.

The reason why we disallow the use of none when a TTY is allocated is because, even if the session isn't interactive, there are situations where you can still interact manually with the session. We really try to lock the none cipher to bulk data transfers. Now, if you want to override this you can do the following and remove all of the guardrails

diff --git a/sshconnect2.c b/sshconnect2.c
index 4ee23410c..cdf377ad4 100644
--- a/sshconnect2.c
+++ b/sshconnect2.c
@@ -496,30 +496,24 @@ ssh_userauth2(struct ssh *ssh, const char *local_user,
        char *myproposal[PROPOSAL_MAX];
        char *s = NULL;
        const char *none_cipher = "none";
-       if (!tty_flag) { /* no null on tty sessions */
-           debug("Requesting none rekeying...");
-           kex_proposal_populate_entries(ssh, myproposal, s, none_cipher,
-                             options.macs,
-                             compression_alg_list(options.compression),
-                             options.hostkeyalgorithms);
-           fprintf(stderr, "WARNING: ENABLED NONE CIPHER!!!\n");
-
-           /* NONEMAC can only be used in context of the NONE CIPHER */
-           if (options.nonemac_enabled == 1) {
-               const char *none_mac = "none";
-               kex_proposal_populate_entries(ssh, myproposal, s, none_cipher,
-                                 none_mac,
-                                 compression_alg_list(options.compression),
-                                 options.hostkeyalgorithms);
+       debug("Requesting none rekeying...");
+       kex_proposal_populate_entries(ssh, myproposal, s, none_cipher,
+                         options.macs,
+                         compression_alg_list(options.compression),
+                         options.hostkeyalgorithms);
+       fprintf(stderr, "WARNING: ENABLED NONE CIPHER!!!\n");
+
+       /* NONEMAC can only be used in context of the NONE CIPHER */
+       if (options.nonemac_enabled == 1) {
+         const char *none_mac = "none";
+         kex_proposal_populate_entries(ssh, myproposal, s, none_cipher,
+                       none_mac,
+                       compression_alg_list(options.compression),
+                       options.hostkeyalgorithms);
                fprintf(stderr, "WARNING: ENABLED NONE MAC\n");
-           }
-           kex_prop2buf(ssh->kex->my, myproposal);
-           packet_request_rekeying();
-       } else {
-           /* requested NONE cipher when in a tty */
-           debug("Cannot switch to NONE cipher with tty allocated");
-           fprintf(stderr, "NONE cipher switch disabled when a TTY is allocated\n");
        }
+       kex_prop2buf(ssh->kex->my, myproposal);
+       packet_request_rekeying();
    }

    if (ssh_packet_connection_is_on_socket(ssh)) {

If you do that then you can use the none cipher switches in any scenario. I do not recommend this for anything other than poking around and your own development purposes. If you can find a better test than looking for an active TTY let me know and I'll consider it.

justinclift commented 2 months ago

Thanks heaps @rapier1. :smile:

As I'm just doing this stuff in a local testlab environment for the moment (ie security isn't important), I'm using Proxmox's insecure setting now instead. That's just doing direct tcp port transfers for most stuff, avoiding the need for ssh.

When it comes time to deploy Proxmox in a production scenario though, that'll be when I probably need to look at the hpn-ssh part of things again. For that environment, the None ciper pieces aren't really suitable. :smile: