rapier1 / hpn-ssh

HPN-SSH based on OpenSSH
https://psc.edu/hpn-ssh-home
Other
333 stars 45 forks source link

pthreads segfault on RHEL 8.5 #36

Closed somewhere-or-other closed 1 year ago

somewhere-or-other commented 2 years ago

I'm building on a RHEL 8.5 image, and keep running into segfaults in the child process after a connection is made and authenticated. I'm not sure if the problem is yours, or something having changed with pthreads, etc. I thought I'd post about it here, and see what happens. If I'm doing something wrong, I'm happy to take feedback.

I've encountered this problem with the master branch (as of commit ebf1feed184a2388c4af376a19ec668090dcd187). Basically, when I launch the sshd daemon (/usr/local/openssh-hpn/master/sbin/sshd -ddd -p 2200 -f /etc/ssh/sshd_config, in this case), it runs and waits for the connection. When I connect from another host, it gets all the way through the authentication, and then the child process that it fork()ed off, segfaults (backtrace below), and the connection closes.

For reference, this is on RHEL 8.5, with GCC 8.5.0, glibc-2.28-164.el8. I manually ran the configure/make/make install, with the following syntax on the configure line:

./configure --prefix=/usr/local/openssh-hpn/master --sysconfdir=/etc/ssh/ --with-default-path=/usr/local/bin:/bin:/usr/bin --with-superuser-path=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin --with-md5-passwords --with-pam --with-privsep-path=/var/empty/sshd --with-libedit --with-xauth=/usr/bin/xauth --disable-strip

When I use gdb and the core file generated to get a backtrace, here's what I find:

(gdb) bt
#0  __pthread_cancel (th=0) at pthread_cancel.c:33
#1  0x0000561e20178d77 in stop_and_join_pregen_threads (c=c@entry=0x7f77a8ae3010) at cipher-ctr-mt.c:221
#2  0x0000561e20178e8e in ssh_aes_ctr_cleanup (ctx=0x561e21baf280) at cipher-ctr-mt.c:638
#3  0x00007f77b0cee534 in EVP_CIPHER_CTX_reset () from /lib64/libcrypto.so.1.1
#4  0x00007f77b0cee64d in EVP_CIPHER_CTX_free () from /lib64/libcrypto.so.1.1
#5  0x0000561e20178767 in cipher_init (ccp=ccp@entry=0x561e21b91858, cipher=0x561e2040b400 <ciphers+160>, 
    key=0x561e21b86b70 "\301\367\">e\255\273\235\353Q\363b@{,\314\020\314\303\020\365\231\357\324\364\351\036P\274\215\n}", keylen=16, 
    iv=0x561e21bc3e30 "\302\002F\330.>", ivlen=<optimized out>, do_encrypt=1) at cipher.c:357
#6  0x0000561e2017ffb8 in ssh_set_newkeys (ssh=ssh@entry=0x561e21b96540, mode=mode@entry=1) at packet.c:914
#7  0x0000561e201808ef in ssh_packet_send2_wrapped (ssh=ssh@entry=0x561e21b96540) at packet.c:1252
#8  0x0000561e20180988 in ssh_packet_send2 (ssh=0x561e21b96540) at packet.c:1319
#9  0x0000561e2018213b in sshpkt_send (ssh=ssh@entry=0x561e21b96540) at packet.c:2741
#10 0x0000561e20197970 in kex_send_newkeys (ssh=ssh@entry=0x561e21b96540) at kex.c:460
#11 0x0000561e2019ad0c in input_kex_gen_init (type=<optimized out>, seq=<optimized out>, ssh=0x561e21b96540) at kexgen.c:337
#12 0x0000561e2018928a in ssh_dispatch_run (ssh=ssh@entry=0x561e21b96540, mode=mode@entry=1, done=done@entry=0x0) at dispatch.c:113
#13 0x0000561e20189359 in ssh_dispatch_run_fatal (ssh=ssh@entry=0x561e21b96540, mode=mode@entry=1, done=done@entry=0x0) at dispatch.c:133
#14 0x0000561e20136d1f in process_buffered_input_packets (ssh=0x561e21b96540) at serverloop.c:365
#15 server_loop2 (ssh=ssh@entry=0x561e21b96540, authctxt=authctxt@entry=0x561e21b98090) at serverloop.c:365
#16 0x0000561e2014106f in do_authenticated2 (authctxt=0x561e21b98090, ssh=0x561e21b96540) at session.c:2642
#17 do_authenticated (ssh=0x561e21b96540, authctxt=0x561e21b98090) at session.c:365
#18 0x0000561e20127ac1 in main (ac=<optimized out>, av=<optimized out>) at sshd.c:2343
(gdb)

If there are further debugging steps I can take to help isolate this problem, please let me know. I may be more of a sysadmin than a developer, but I'll do my best to follow instructions.

Lloyd

rapier1 commented 2 years ago

Lloyd,

I'm grabbing a copy of RHEL 8.5 now. Once I get it set up I'll try to recreate the problem. This is happening somewhere in the multithreaded aes-ctr cipher which is annoying as I've done a lot of work on that lately. As I get more information I'll update you.

Chris

On 2/14/22 1:01 PM, Lloyd Brown wrote:

I'm building on a RHEL 8.5 image, and keep running into segfaults in the child process after a connection is made and authenticated. I'm not sure if the problem is yours, or something having changed with pthreads, etc. I thought I'd post about it here, and see what happens. If I'm doing something wrong, I'm happy to take feedback.

I've encountered this problem with the master branch (as of commit ebf1fee https://github.com/rapier1/openssh-portable/commit/ebf1feed184a2388c4af376a19ec668090dcd187). Basically, when I launch the /sshd/ daemon (|/usr/local/openssh-hpn/master/sbin/sshd -ddd -p 2200 -f /etc/ssh/sshd_config|, in this case), it runs and waits for the connection. When I connect from another host, it gets all the way through the authentication, and then the child process that it "fork()"ed off, segfaults (backtrace below), and the connection closes.

For reference, this is on RHEL 8.5, with GCC 8.5.0, glibc-2.28-164.el8. I manually ran the configure/make/make install, with the following syntax on the configure line:

|./configure --prefix=/usr/local/openssh-hpn/master --sysconfdir=/etc/ssh/ --with-default-path=/usr/local/bin:/bin:/usr/bin --with-superuser-path=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin --with-md5-passwords --with-pam --with-privsep-path=/var/empty/sshd --with-libedit --with-xauth=/usr/bin/xauth --disable-strip |

When I use |gdb| and the core file generated to get a backtrace, here's what I find:

|(gdb) bt #0 __pthread_cancel (th=0) at pthread_cancel.c:33 #1 0x0000561e20178d77 in stop_and_join_pregen_threads @.=0x7f77a8ae3010) at cipher-ctr-mt.c:221 #2 0x0000561e20178e8e in ssh_aes_ctr_cleanup (ctx=0x561e21baf280) at cipher-ctr-mt.c:638 #3 0x00007f77b0cee534 in EVP_CIPHER_CTX_reset () from /lib64/libcrypto.so.1.1 #4 0x00007f77b0cee64d in EVP_CIPHER_CTX_free () from /lib64/libcrypto.so.1.1 #5 0x0000561e20178767 in cipher_init @.=0x561e21b91858, cipher=0x561e2040b400 <ciphers+160>, key=0x561e21b86b70 "\301\367\">e\255\273\235\353Q\363b@{,\314\020\314\303\020\365\231\357\324\364\351\036P\274\215\n}", keylen=16, iv=0x561e21bc3e30 "\302\002F\330.>", ivlen=, do_encrypt=1) at cipher.c:357 #6 0x0000561e2017ffb8 in ssh_set_newkeys @.=0x561e21b96540, @.=1) at packet.c:914 #7 0x0000561e201808ef in ssh_packet_send2_wrapped @.=0x561e21b96540) at packet.c:1252 #8 0x0000561e20180988 in ssh_packet_send2 (ssh=0x561e21b96540) at packet.c:1319 #9 0x0000561e2018213b in sshpkt_send @.=0x561e21b96540) at packet.c:2741 #10 0x0000561e20197970 in kex_send_newkeys @.=0x561e21b96540) at kex.c:460 #11 0x0000561e2019ad0c in input_kex_gen_init (type=, seq=, ssh=0x561e21b96540) at kexgen.c:337 #12 0x0000561e2018928a in ssh_dispatch_run @.=0x561e21b96540, @.=1, @.=0x0) at dispatch.c:113 #13 0x0000561e20189359 in ssh_dispatch_run_fatal @.=0x561e21b96540, @.=1, @.***=0x0) at dispatch.c:133 #14 0x0000561e20136d1f in process_buffered_input_packets (ssh=0x561e21b96540) at serverloop.c:365

15 server_loop2 @.***=0x561e21b96540,

@.***=0x561e21b98090) at serverloop.c:365 #16 0x0000561e2014106f in do_authenticated2 (authctxt=0x561e21b98090, ssh=0x561e21b96540) at session.c:2642 #17 do_authenticated (ssh=0x561e21b96540, authctxt=0x561e21b98090) at session.c:365 #18 0x0000561e20127ac1 in main (ac=, av=) at sshd.c:2343 (gdb) |

If there are further debugging steps I can take to help isolate this problem, please let me know. I may be more of a sysadmin than a developer, but I'll do my best to follow instructions.

Lloyd

— Reply to this email directly, view it on GitHub https://github.com/rapier1/openssh-portable/issues/36, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKL66C5Y56VJLFSX3UGLEDU3E7QDANCNFSM5OMFLCGQ. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you are subscribed to this thread.Message ID: @.***>

rapier1 commented 2 years ago

Lloyd,

I just got RHEL 8.5 running on a VM. This is fresh out of the box with only the updates applied and the necessary libraries (getting libedit-devel was annoying though). I built it with the configuration you gave me. The only thing I did different than you is run autoconf before ./configure.

I wasn't not able to recreate the problem. I tried a few different configurations, settings, and ciphers and everything was working as expected. Did you make any other changes?

somewhere-or-other commented 2 years ago

This is an NFS-rooted image for deployment on a large HPC cluster. There have been several things that I've had to customize, but I can't think of anything in particular that would affect this. Would it make sense to compare versions numbers of specific packages? I'm not sure which would be the most relevant, but I'm happy to try that.

I did do a bunch of aclocal/autoconf/automake/etc as well. Sorry I didn't document that. I guess I assumed it went without saying.

I'm re-cloning again from scratch, to see if there's anything I accidentally did in the repository that might've had an effect. I tried building based on at least 2 other git tags before using the master branch, so it's possible there was something residual. I'll get back here shortly with the result.

somewhere-or-other commented 2 years ago

Hmm. Unfortunately I'm getting the same result, after using this newly-cloned copy of the repository:

git clone https://github.com/rapier1/openssh-portable.git openssh-hpn-2
cd openssh-hpn-2/
aclocal
autoheader
autoconf
./configure --prefix=/usr/local/openssh-hpn/master --sysconfdir=/etc/ssh/ --with-default-path=/usr/local/bin:/bin:/usr/bin --with-superuser-path=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin --with-md5-passwords --with-pam --with-privsep-path=/var/empty/sshd --with-libedit --with-xauth=/usr/bin/xauth --disable-strip
make
make install

It's a long-shot, but could it be affected by FIPS mode? I do have fips=1 on my kernel command-line. I wouldn't think it's relevant, now that OpenSSH uses OpenSSL for it's crypto, but it's possible. I'd think we'd have other symptoms (eg. kernel errors complaining about invalid/unapproved algorithms), if that were the problem.

There could certainly be others, but here are the versions of all the packages that provides any of the paths, in the output of ldd. Can you think of anything else that might be worth comparing?

# for i in `ldd /usr/local/openssh-hpn/master/sbin/sshd | awk '{print $3}'`; do rpm -q --whatprovides "$i"; done | sort -u
audit-libs-3.0-0.17.20191104git1c2f876.el8.x86_64
glibc-2.28-164.el8.x86_64
libcap-ng-0.7.11-1.el8.x86_64
libxcrypt-4.1.1-6.el8.x86_64
openssl-libs-1.1.1k-5.el8_5.x86_64
pam-1.3.1-15.el8.x86_64
zlib-1.2.11-17.el8.x86_64
# 
somewhere-or-other commented 2 years ago

After I rebooted the node without the fips=1 I no longer see the problem occurring. I'm able to log in normally.

I'm going to keep testing, and see if I can figure out anything further about what's going on. For reference, this page is RH's official documentation about how to enable FIPS mode, in case you want to verify my findings.

I know that with RHEL7, which shipped OpenSSH 7.4p1, OpenSSH was included in the list of packages that had to be certified for FIPS mode compliance, but with RHEL8, which shipped OpenSSH 8.0p1, it was no longer included. I had heard that OpenSSH had started using OpenSSL libs exclusively for it's crypto setup, which would explain the change between RHEL7 and RHEL8. I had assumed that would still be true with your HPN-modified code, as long as it was based on something >= OpenSSH v 8.0, but perhaps that isn't a correct assumption.

I'm not suggesting that you necessarily need to fix this, or anything. Just trying to understand the situation, and what the limitations are. Deciding to explicitly not support FIPS mode, is a totally understandable response.

Lloyd

rapier1 commented 2 years ago

I'll take a look at that. I haven't even considered what is going on with FIPS so it could be a problem. I'm not opposed to supporting FIPS but I'll need to learn more about it.

The problem could be that the multithreaded aes-ctr mode does use OpenSSL to generate the keystream but XORing the data happens outside of OpenSSL. So it could be an issue there or it could be an issue with how I'm handling the threads.

I'm probably not going to get a chance to look at this in the next couple of days but I do want to figure out what is going on. So please let me know if you find out anything else. I'll also be keeping this ticket open until I either make a fix or explicitly decide against it.

Chris

On 2/15/22 11:49 AM, Lloyd Brown wrote:

After I rebooted the node without the |fips=1| I no longer see the problem occurring. I'm able to log in normally.

I'm going to keep testing, and see if I can figure out anything further about what's going on. For reference, this page https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/security_hardening/assembly_installing-a-rhel-8-system-with-fips-mode-enabled_security-hardening is RH's official documentation about how to enable FIPS mode, in case you want to verify my findings.

I know that with RHEL7, which shipped OpenSSH 7.4p1, OpenSSH was included in the list of packages that had to be certified for FIPS mode compliance, but with RHEL8, which shipped OpenSSH 8.0p1, it was no longer included https://www.redhat.com/en/blog/how-rhel-8-designed-fips-140-2-requirements. I had heard that OpenSSH had started using OpenSSL libs exclusively for it's crypto setup, which would explain the change between RHEL7 and RHEL8. I had assumed that would still be true with your HPN-modified code, as long as it was based on something >= OpenSSH v 8.0, but perhaps that isn't a correct assumption.

I'm not suggesting that you necessarily need to fix this, or anything. Just trying to understand the situation, and what the limitations are. Deciding to explicitly not support FIPS mode, is a totally understandable response.

Lloyd

— Reply to this email directly, view it on GitHub https://github.com/rapier1/openssh-portable/issues/36#issuecomment-1040509880, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKL66FSEIQFQHC6BFYMX7DU3J723ANCNFSM5OMFLCGQ. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you commented.Message ID: @.***>

rapier1 commented 2 years ago

By the way, I did confirm that it is because of FIPS and something to do with how the multithreaded cipher is interacting with it. In the meantime, you can disable the the multithreaded version by using -oDisableMTAES=yes when starting the server (or setting it in the sshd_config) you'll also need to disabled it in the client. Same option but you'd need to add it to the system ssh_config file.

I'm curious as to what's happening here and I will work on it in the next few days. I need to finishing up some packaging for Ubuntu and Fedora first.

Chris

On 2/15/22 11:49 AM, Lloyd Brown wrote:

After I rebooted the node without the |fips=1| I no longer see the problem occurring. I'm able to log in normally.

I'm going to keep testing, and see if I can figure out anything further about what's going on. For reference, this page https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/security_hardening/assembly_installing-a-rhel-8-system-with-fips-mode-enabled_security-hardening is RH's official documentation about how to enable FIPS mode, in case you want to verify my findings.

I know that with RHEL7, which shipped OpenSSH 7.4p1, OpenSSH was included in the list of packages that had to be certified for FIPS mode compliance, but with RHEL8, which shipped OpenSSH 8.0p1, it was no longer included https://www.redhat.com/en/blog/how-rhel-8-designed-fips-140-2-requirements. I had heard that OpenSSH had started using OpenSSL libs exclusively for it's crypto setup, which would explain the change between RHEL7 and RHEL8. I had assumed that would still be true with your HPN-modified code, as long as it was based on something >= OpenSSH v 8.0, but perhaps that isn't a correct assumption.

I'm not suggesting that you necessarily need to fix this, or anything. Just trying to understand the situation, and what the limitations are. Deciding to explicitly not support FIPS mode, is a totally understandable response.

Lloyd

— Reply to this email directly, view it on GitHub https://github.com/rapier1/openssh-portable/issues/36#issuecomment-1040509880, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKL66FSEIQFQHC6BFYMX7DU3J723ANCNFSM5OMFLCGQ. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you commented.Message ID: @.***>

somewhere-or-other commented 2 years ago

Chris,

Thank you. I can confirm with FIPS mode on, launching using the syntax below, that I can connect successfully with a non-HPN client, which I was not able to do before.

/usr/local/openssh-hpn/master/sbin/sshd -oDisableMTAES=yes -ddd -p 2200 -f /etc/ssh/sshd_config

That will probably be an acceptable workaround for my purposes for now, though I am also curious what happens with your further investigations. But I totally understand about the uncertain timeline.

Lloyd

klardotsh commented 4 months ago

I am getting what appears to be this same issue on RHEL8 (OpenSSL 1.1.1k, no ability to pull OpenSSL3) FIPS boxes using the latest HPNSSH 18.4.1. Here's my findings in general:

At this point, I'm a bit lost where to continue looking or how to resolve this one (as are various teammates who've been helping debug this), and so I'd like to reopen this issue thread for some advice/pointers, and to try to help contribute to a fix if I can. Thanks! Below is a stack trace from GDB if it helps.

Program received signal SIGSEGV, Segmentation fault.
                                                    0x00007ffff710db54 in pthread_cancel () from /lib64/libpthread.so.0

(gdb) bt
#0  0x00007ffff710db54 in pthread_cancel () from /lib64/libpthread.so.0
#1  0x00005555555a73d5 in stop_and_join_pregen_threads ()
#2  0x00005555555a77de in ssh_aes_ctr_cleanup ()
#3  0x00007ffff6b2d534 in EVP_CIPHER_CTX_reset () from /lib64/libcrypto.so.1.1
#4  0x00007ffff6b2d64d in EVP_CIPHER_CTX_free () from /lib64/libcrypto.so.1.1
#5  0x00005555555a6dba in cipher_init ()
#6  0x00005555555ae9b3 in ssh_set_newkeys ()
#7  0x00005555555b0d89 in ssh_packet_send2_wrapped ()
#8  0x00005555555b0e78 in ssh_packet_send2 ()
#9  0x00005555555d0a00 in kex_send_newkeys ()
#10 0x00005555555d4692 in input_kex_gen_reply ()
#11 0x00005555555b763a in ssh_dispatch_run ()
#12 0x00005555555b7709 in ssh_dispatch_run_fatal ()
#13 0x00005555555776a2 in client_loop ()
#14 0x000055555556460b in main ()
(gdb)

And for version info:

OpenSSH_9.7p1-hpn18.4.1, OpenSSL 1.1.1k  FIPS 25 Mar 2021
rapier1 commented 4 months ago

So I forgot about FIPS puking on the multithreaded AES. I've accepted your PR and it's moving through the process of making it into master. Have you seen any problems with the default chacha20 cipher? Just curious as that's threaded as well.