nodejs / node

Node.js JavaScript runtime ✨🐢🚀✨
https://nodejs.org
Other
106.95k stars 29.18k forks source link

ALPN callback function sometimes leads to segfault in node.js >= 18.13.0 #47207

Closed gugu closed 1 year ago

gugu commented 1 year ago

Version

18.15.0

Platform

Linux 2a53a1799e0b 5.15.0-67-generic #74-Ubuntu SMP Wed Feb 22 14:14:39 UTC 2023 x86_64 GNU/Linux

Subsystem

tls

What steps will reproduce the bug?

I see this from logs of the my node.js server. I did not find a way to reproduce yet, and need some help with that. I've attached stacktrace, but did not yet found ALPN header value, which causes this error. According to my investigations, function SelectALPNCallback calls SSL_select_next_proto with NULL instead of correct pointer. Looks like some maliciously crafted ALPN header can lead to such error

How often does it reproduce? Is there a required condition?

It is a rare case I capture from logs (around 1 req/million). I can add some code to get more information about the bug, but don't know what to do

What is the expected behavior? Why is that the expected behavior?

Do not produce segfault

What do you see instead?

Segfoult with stacktrace:

PID 9 received SIGSEGV for address: 0x0
/srv/shorturl_redirector/node_modules/segfault-handler/build/Release/segfault-handler.node(+0x3236)[0x7f86dc0d0236]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x13140)[0x7f86dedcf140]
node(SSL_select_next_proto+0x4c)[0x17f0bfc]
node[0xd243e0]
node(tls_handle_alpn+0x53)[0x1834f83]
node(tls_parse_all_extensions+0x143)[0x18176f3]
node(tls_post_process_client_hello+0x70)[0x1835200]
node[0x1822b74]
node(ssl3_read_bytes+0x320)[0x1811050]
node(ssl3_read+0x60)[0x17e0c40]
node(SSL_read+0x87)[0x17ee3f7]
node(_ZN4node6crypto7TLSWrap8ClearOutEv+0x77)[0xd2c8f7]
node(_ZN4node6crypto7TLSWrap12OnStreamReadElRK8uv_buf_t+0xf8)[0xd2d5f8]
node(_ZN4node15LibuvStreamWrap8OnUvReadElPK8uv_buf_t+0x89)[0xc6f959]
node[0xc6fd68]
node[0x1676f67]
node[0x1677790]
node[0x167d534]
node(uv_run+0x14e)[0x166b95e]
node(_ZN4node13SpinEventLoopEPNS_11EnvironmentE+0x14d)[0xabda2d]
node(_ZN4node16NodeMainInstance3RunEv+0xf4)[0xbc1874]
node(_ZN4node22LoadSnapshotDataAndRunEPPKNS_12SnapshotDataEPKNS_20InitializationResultE+0xb4)[0xb36434]
node(_ZN4node5StartEiPPc+0x2df)[0xb3a02f]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xea)[0x7f86dec0ad0a]
node(_start+0x2e)[0xaba37e]
Segmentation fault (core dumped)

Additional information

No response

gugu commented 1 year ago

And another one few second ago:

PID 8 received SIGSEGV for address: 0x400000001081
/srv/shorturl_redirector/node_modules/segfault-handler/build/Release/segfault-handler.node(+0x3236)[0x7fbc68045236]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x13140)[0x7fbc6ad44140]
node(SSL_select_next_proto+0x4c)[0x17f0bfc]
node[0xd243e0]
node(tls_handle_alpn+0x53)[0x1834f83]
node(tls_parse_all_extensions+0x143)[0x18176f3]
node(tls_post_process_client_hello+0x70)[0x1835200]
node[0x1822b74]
node(ssl3_read_bytes+0x320)[0x1811050]
node(ssl3_read+0x60)[0x17e0c40]
node(SSL_read+0x87)[0x17ee3f7]
node(_ZN4node6crypto7TLSWrap8ClearOutEv+0x77)[0xd2c8f7]
node(_ZN4node6crypto7TLSWrap12OnStreamReadElRK8uv_buf_t+0xf8)[0xd2d5f8]
node(_ZN4node15LibuvStreamWrap8OnUvReadElPK8uv_buf_t+0x89)[0xc6f959]
node[0xc6fd68]
node[0x1676f67]
node[0x1677790]
node[0x167d534]
node(uv_run+0x14e)[0x166b95e]
node(_ZN4node13SpinEventLoopEPNS_11EnvironmentE+0x14d)[0xabda2d]
node(_ZN4node16NodeMainInstance3RunEv+0xf4)[0xbc1874]
node(_ZN4node22LoadSnapshotDataAndRunEPPKNS_12SnapshotDataEPKNS_20InitializationResultE+0xb4)[0xb36434]
node(_ZN4node5StartEiPPc+0x2df)[0xb3a02f]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xea)[0x7fbc6ab7fd0a]
node(_start+0x2e)[0xaba37e]
Segmentation fault (core dumped)
gugu commented 1 year ago

It's 2AM and I finally extracted core dump from kubernetes. Does not tell much because I need debug symbols for node 18.15.0

(gdb) bt
#0  0x00000000017f0bfc in SSL_select_next_proto ()                                                                                                                                           
#1  0x0000000000d243e0 in node::crypto::(anonymous namespace)::SelectALPNCallback(ssl_st*, unsigned char const**, unsigned char*, unsigned char const*, unsigned int, void*) ()              
#2  0x0000000001834f83 in tls_handle_alpn ()                                                                                                                                                 
#3  0x00000000018176f3 in tls_parse_all_extensions ()                                                                                                                                        
#4  0x0000000001835200 in tls_post_process_client_hello ()                                                                                                                                   
#5  0x0000000001822b74 in state_machine.part ()
#6  0x0000000001811050 in ssl3_read_bytes ()
#7  0x00000000017e0c40 in ssl3_read ()
#8  0x00000000017ee3f7 in SSL_read ()
#9  0x0000000000d2c8f7 in node::crypto::TLSWrap::ClearOut() ()
#10 0x0000000000d2d5f8 in node::crypto::TLSWrap::OnStreamRead(long, uv_buf_t const&) ()
#11 0x0000000000c6f959 in node::LibuvStreamWrap::OnUvRead(long, uv_buf_t const*) ()
#12 0x0000000000c6fd68 in node::LibuvStreamWrap::ReadStart()::{lambda(uv_stream_s*, long, uv_buf_t const*)#2}::_FUN(uv_stream_s*, long, uv_buf_t const*) ()
#13 0x0000000001676f67 in uv__read (stream=stream@entry=0x7f4994e92e90) at ../deps/uv/src/unix/stream.c:1201
#14 0x0000000001677790 in uv__stream_io (loop=<optimized out>, w=0x7f4994e92f18, events=1) at ../deps/uv/src/unix/stream.c:1270
#15 0x000000000167d534 in uv__io_poll (loop=loop@entry=0x526dfc0 <default_loop_struct>, timeout=<optimized out>) at ../deps/uv/src/unix/epoll.c:374
#16 0x000000000166b95e in uv_run (loop=0x526dfc0 <default_loop_struct>, mode=UV_RUN_DEFAULT) at ../deps/uv/src/unix/core.c:406
#17 0x0000000000abda2d in node::SpinEventLoop(node::Environment*) ()
#18 0x0000000000bc1874 in node::NodeMainInstance::Run() ()
#19 0x0000000000b36434 in node::LoadSnapshotDataAndRun(node::SnapshotData const**, node::InitializationResult const*) ()
#20 0x0000000000b3a02f in node::Start(int, char**) ()
#21 0x00007f4ee8745d0a in __libc_start_main (main=0xab1f70 <main>, argc=3, argv=0x7ffdb3a07ad8, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, 

and

(gdb) info frame
Stack level 1, frame at 0x7ffdb39ffe50:
 rip = 0xd243e0 in node::crypto::(anonymous namespace)::SelectALPNCallback(ssl_st*, unsigned char const**, unsigned char*, unsigned char const*, unsigned int, void*); saved rip = 0x1834f83
 called by frame at 0x7ffdb39ffe90, caller of frame at 0x7ffdb39ffe40
 Arglist at 0x7ffdb39ffe38, args: 
 Locals at 0x7ffdb39ffe38, Previous frame's sp is 0x7ffdb39ffe50
 Saved registers:
  rbp at 0x7ffdb39ffe40, rip at 0x7ffdb39ffe48
bnoordhuis commented 1 year ago

A disassemble and info registers of the top frame might be enough for me to tell what's going on, and maybe a hex dump of the top 32 or 64 bytes of the stack.

gugu commented 1 year ago
Dump of assembler code for function SSL_select_next_proto:
   0x00000000017f0bb0 <+0>:     push   %rbp
   0x00000000017f0bb1 <+1>:     mov    %rsp,%rbp
   0x00000000017f0bb4 <+4>:     push   %r15
   0x00000000017f0bb6 <+6>:     mov    %r8,%r15
   0x00000000017f0bb9 <+9>:     push   %r14
   0x00000000017f0bbb <+11>:    mov    %r9d,%r14d
   0x00000000017f0bbe <+14>:    push   %r13
   0x00000000017f0bc0 <+16>:    push   %r12
   0x00000000017f0bc2 <+18>:    push   %rbx
   0x00000000017f0bc3 <+19>:    sub    $0x48,%rsp
   0x00000000017f0bc7 <+23>:    mov    %rdi,-0x60(%rbp)
   0x00000000017f0bcb <+27>:    mov    %rsi,-0x68(%rbp)
   0x00000000017f0bcf <+31>:    mov    %rdx,-0x58(%rbp)
   0x00000000017f0bd3 <+35>:    mov    %ecx,-0x48(%rbp)
   0x00000000017f0bd6 <+38>:    movl   $0x0,-0x44(%rbp)
   0x00000000017f0bdd <+45>:    test   %ecx,%ecx
   0x00000000017f0bdf <+47>:    je     0x17f0c90 <SSL_select_next_proto+224>
   0x00000000017f0be5 <+53>:    mov    -0x44(%rbp),%eax
   0x00000000017f0be8 <+56>:    mov    -0x58(%rbp),%rdi
   0x00000000017f0bec <+60>:    xor    %r12d,%r12d
   0x00000000017f0bef <+63>:    mov    %rax,%rcx
   0x00000000017f0bf2 <+66>:    add    %rdi,%rax
   0x00000000017f0bf5 <+69>:    mov    %rax,-0x50(%rbp)
   0x00000000017f0bf9 <+73>:    lea    0x1(%rcx),%ebx
=> 0x00000000017f0bfc <+76>:    movzbl (%rax),%eax
   0x00000000017f0bff <+79>:    add    %rdi,%rbx
   0x00000000017f0c02 <+82>:    mov    %rax,%r13
   0x00000000017f0c05 <+85>:    mov    %rax,-0x38(%rbp)
   0x00000000017f0c09 <+89>:    mov    %rbx,-0x40(%rbp)
   0x00000000017f0c0d <+93>:    test   %r14d,%r14d
   0x00000000017f0c10 <+96>:    jne    0x17f0c22 <SSL_select_next_proto+114>
   0x00000000017f0c12 <+98>:    jmp    0x17f0c78 <SSL_select_next_proto+200>
   0x00000000017f0c14 <+100>:   nopl   0x0(%rax)
   0x00000000017f0c18 <+104>:   lea    0x1(%r12,%rbx,1),%r12d
   0x00000000017f0c1d <+109>:   cmp    %r12d,%r14d
   0x00000000017f0c20 <+112>:   jbe    0x17f0c78 <SSL_select_next_proto+200>
   0x00000000017f0c22 <+114>:   mov    %r12d,%eax
   0x00000000017f0c25 <+117>:   movzbl (%r15,%rax,1),%ebx
   0x00000000017f0c2a <+122>:   cmp    %r13b,%bl
   0x00000000017f0c2d <+125>:   jne    0x17f0c18 <SSL_select_next_proto+104>
   0x00000000017f0c2f <+127>:   lea    0x1(%r12),%esi
   0x00000000017f0c34 <+132>:   mov    -0x38(%rbp),%rdx
   0x00000000017f0c38 <+136>:   mov    -0x40(%rbp),%rdi
   0x00000000017f0c3c <+140>:   add    %r15,%rsi
   0x00000000017f0c3f <+143>:   call   0xa84da0 <memcmp@plt>
   0x00000000017f0c44 <+148>:   test   %eax,%eax
(gdb) info registers
rax            0x75684f4f7156766e  8460099102637979246
rbx            0x1                 1
rcx            0x0                 0
rdx            0x75684f4f7156766e  8460099102637979246
rsi            0x7ffdb39ffe57      140727617060439
rdi            0x75684f4f7156766e  8460099102637979246
rbp            0x7ffdb39ffe30      0x7ffdb39ffe30
rsp            0x7ffdb39ffdc0      0x7ffdb39ffdc0
r8             0x7f499574c150      139954016797008
r9             0xc                 12
r10            0x75684f4f7156766e  8460099102637979246
r11            0x27ab84dcf749      43617621964617
r12            0x0                 0
r13            0x80                128
r14            0xc                 12
r15            0x7f499574c150      139954016797008
rip            0x17f0bfc           0x17f0bfc <SSL_select_next_proto+76>
eflags         0x10202             [ IF RF ]
cs             0x33                51
ss             0x2b                43
ds             0x0                 0
es             0x0                 0
fs             0x0                 0
gs             0x0                 0
k0             0xfefef000          4278120448
k1             0xfc000000          4227858432
k2             0xffffffff          4294967295
k3             0xffffffff          4294967295
k4             0xffffffff          4294967295
k5             0xffffffff          4294967295
k6             0xffffffff          4294967295
k7             0x7                 7
gugu commented 1 year ago

ALPN callback

(gdb) info frame
Stack level 1, frame at 0x7ffdb39ffe50:
 rip = 0xd243e0 in node::crypto::(anonymous namespace)::SelectALPNCallback(ssl_st*, unsigned char const**, unsigned char*, unsigned char const*, unsigned int, void*); saved rip = 0x1834f83
 called by frame at 0x7ffdb39ffe90, caller of frame at 0x7ffdb39ffe40
 Arglist at 0x7ffdb39ffe38, args: 
 Locals at 0x7ffdb39ffe38, Previous frame's sp is 0x7ffdb39ffe50
 Saved registers:
  rbp at 0x7ffdb39ffe40, rip at 0x7ffdb39ffe48
(gdb) x/8xw 0x7ffdb39ffe38
0x7ffdb39ffe38: 0x00d243e0      0x00000000      0xb39ffe80      0x00007ffd
0x7ffdb39ffe48: 0x01834f83      0x00000000      0x0000001c      0x00000000
(gdb) 
gugu commented 1 year ago

OpenSSL:

(gdb) info frame
Stack level 0, frame at 0x7ffdb39ffe40:
 rip = 0x17f0bfc in SSL_select_next_proto; saved rip = 0xd243e0
 called by frame at 0x7ffdb39ffe50
 Arglist at 0x7ffdb39ffe30, args: 
 Locals at 0x7ffdb39ffe30, Previous frame's sp is 0x7ffdb39ffe40
 Saved registers:
  rbx at 0x7ffdb39ffe08, rbp at 0x7ffdb39ffe30, r12 at 0x7ffdb39ffe10, r13 at 0x7ffdb39ffe18, r14 at 0x7ffdb39ffe20, r15 at 0x7ffdb39ffe28, rip at 0x7ffdb39ffe38
(gdb) x/16wx 0x7ffdb39ffe30
0x7ffdb39ffe30: 0xb39ffe40      0x00007ffd      0x00d243e0      0x00000000
0x7ffdb39ffe40: 0xb39ffe80      0x00007ffd      0x01834f83      0x00000000
0x7ffdb39ffe50: 0x0000001c      0x00000000      0x00000000      0x00000000
0x7ffdb39ffe60: 0x95ca0898      0x00007f49      0x95ca09d8      0x00007f49
bnoordhuis commented 1 year ago

I can't be 100% sure because the data is buried deeper in the stack than I anticipated (isn't it always like that?) but it looks like the ALPN string from the ClientHello packet is zero-sized.

SSL_select_next_proto() selects the first item from the client's string if there's no matching entry in the server's string but yeah, that won't work if the client's string is empty.

Do you have the opportunity to try out a patch locally? Does this one-liner fix it?

diff --git a/src/crypto/crypto_tls.cc b/src/crypto/crypto_tls.cc
index f14adec767a..fea3186c17e 100644
--- a/src/crypto/crypto_tls.cc
+++ b/src/crypto/crypto_tls.cc
@@ -225,6 +225,8 @@ int SelectALPNCallback(
     const unsigned char* in,
     unsigned int inlen,
     void* arg) {
+  if (inlen == 0) return SSL_TLSEXT_ERR_ALERT_FATAL;
+
   TLSWrap* w = static_cast<TLSWrap*>(arg);
   const std::vector<unsigned char>& alpn_protos = w->alpn_protos_;
gugu commented 1 year ago

No, but I'll try to patch openssl locally to try to reproduce this error on my dev end. If it is true it looks like a possibility DoS attack on node.js

gugu commented 1 year ago

Found another one, tls client:

> s = tls.connect('short.io', 443, {ALPNProtocols: [""]})
node[138836]: ../src/crypto/crypto_tls.cc:1533:static void node::crypto::TLSWrap::SetALPNProtocols(const v8::FunctionCallbackInfo<v8::Value>&): Assertion `SetALPN(w->ssl_, args[0])' failed.
 1: 0xb06730 node::Abort() [node]
 2: 0xb067ae  [node]
 3: 0xca27ea node::crypto::TLSWrap::SetALPNProtocols(v8::FunctionCallbackInfo<v8::Value> const&) [node]
 4: 0xd3e33e  [node]
 5: 0xd3f75f v8::internal::Builtin_HandleApiCall(int, unsigned long*, v8::internal::Isolate*) [node]
 6: 0x15da1b9  [node]
gugu commented 1 year ago

I can't be 100% sure because the data is buried deeper in the stack than I anticipated (isn't it always like that?) but it looks like the ALPN string from the ClientHello packet is zero-sized.

SSL_select_next_proto() selects the first item from the client's string if there's no matching entry in the server's string but yeah, that won't work if the client's string is empty.

Do you have the opportunity to try out a patch locally? Does this one-liner fix it?

diff --git a/src/crypto/crypto_tls.cc b/src/crypto/crypto_tls.cc
index f14adec767a..fea3186c17e 100644
--- a/src/crypto/crypto_tls.cc
+++ b/src/crypto/crypto_tls.cc
@@ -225,6 +225,8 @@ int SelectALPNCallback(
     const unsigned char* in,
     unsigned int inlen,
     void* arg) {
+  if (inlen == 0) return SSL_TLSEXT_ERR_ALERT_FATAL;
+
   TLSWrap* w = static_cast<TLSWrap*>(arg);
   const std::vector<unsigned char>& alpn_protos = w->alpn_protos_;

inlen equal to zero does not cause any issues, the problem is somewhere in out variable I think. Instead of prod I'm trying to reproduce it in this simple program:

#include <iostream>
#include <openssl/ssl.h>
using namespace std;
int main() {
        unsigned char *out;
        unsigned char outlen;
        const char *server = "\x08http/1.1";
        const char *client = "\x08http/1.1";
        cout << strlen(server) << endl;
        int status = SSL_select_next_proto(&out, &outlen, (const unsigned char*)server, strlen(server), (const unsigned char*)client, strlen(client));
        cout << status << endl;
}

UPD: Setting ALPNProtocols: [] in createServer solves the issue for me

bnoordhuis commented 1 year ago

UPD: Setting ALPNProtocols: [] in createServer solves the issue for me

What was it set to before? Which createServer method are you using (net/http/https/http2)?

gugu commented 1 year ago

it is https.createServer(), default value is ['http/1.1']

bnoordhuis commented 1 year ago

FWIW, I've not been able to reproduce the crash (so far at least.)

I tried crafting a ClientHello with an empty ALPN extension record but I get back a handshake_failure alert and the connection is subsequently closed. No crash.

tniessen commented 1 year ago

I tried various combinations of invalid ALPN extensions last week and also wasn't able to reproduce a crash on the server side. Is there any chance you might be able to share a traffic dump of just the TLS handshake that causes the server to crash @gugu?

gugu commented 1 year ago

It required me to deploy crashing version of the app to production. If I can get debug symbols of the version released on nodejs.org - I can get much more data. But if it is not possible - then I'll deploy. One more thing from the available debug symbols of libuv:

#13 0x0000000001676f67 in uv__read (stream=stream@entry=0x7f4994e92e90) at ../deps/uv/src/unix/stream.c:1201

(gdb) info locals
buflen = 65536
buf = {base = 0x7f499466abb0 "\026\003\001\002", len = 65536}
nread = <optimized out>
msg = {msg_name = 0x5c11278, msg_namelen = 96539248, msg_iov = 0x0, msg_iovlen = 0, msg_control = 0x0, msg_controllen = 0, msg_flags = 0}
cmsg_space = '\000' <repeats 40 times>, "\210\022\301\005\000\000\000\000@2\301\005\000\000\000\000\200\071Z\344N\177\000\000!\000\000\000\000\000\000\000\200\022\301\005\000\000\000\000\002\000\000\000\000\000\000\000`E\240\263\375\177\000\000@E\240\263\375\177\000\000\215\350\254", '\000' <repeats 13 times>, "\002", '\000' <repeats 19 times>, "\230\206\vA@E\240\263\375\177\000\000\200\071Z\344N\177\000\000\200\022\301\005\000\000\000\000\200\071Z\344N\177\000\000\200\022\301\005\000\000\000\000\260\361\353\345N\177\000\000\300E\240\263\375\177\000\000"...
count = <optimized out>
err = <optimized out>
is_ipc = 0
__PRETTY_FUNCTION__ = "uv__read"
bnoordhuis commented 1 year ago

Just a random observation but this:

buf = {base = 0x7f499466abb0 "\026\003\001\002", len = 65536}

Means the buffer contains a handshake frame that's between 512 and 767 bytes big. That's about average for a ClientHello.

(That \002 is the upper octet of a 16 bits big-endian value so at best it's 2*256 + 255 bytes big.)

If you can get me those bytes, I can probably figure out what's wrong.

gugu commented 1 year ago

I forgot about network byte order, now things looking much better. ALPN is completely ok. Here is what I've described so far:

16 - message type
0301 - message version
0200 - content length

01 - client hello message
0001fc - client hello length
0303 - client hello protocol version
125c42b2d5c2396d61c9be6c6e189eefb211d64bdafe85ba6131cabc8aedb2f8 - random value
20 - session id length
8a08d47ae46281fefdfb732c131ecd7c58dba0fb95f842a292ef6ac584f723c4 - session id
0020 - cipher suites length
dada - GREASE compatibility check
1301 - TLS_AES_128_GCM_SHA256 (All of them are decoded by copilot, did not check if it is correct)
1302 - TLS_AES_256_GCM_SHA384
1303 - TLS_CHACHA20_POLY1305_SHA256
c02b - TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256
c02f - TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384
c02c - TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256
c030 - TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
cca9 - TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
cca8 - TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256
c013 - TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA
c014 - TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA
009c - TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256
009d - TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA384
002f - TLS_ECDHE_ECDSA_WITH_NULL_SHA
0035 - TLS_ECDHE_ECDSA_WITH_RC4_128_SHA
01 - number of compression methods to follow
00 - compression method "null"
0193 - length of extensions
8a8a0000 - GREASE compatibility check
0000 - server name extension
000e 000c 00 0009 xxxxxxxxxxxxxxxxxx - SNI (hostname can share if needed)
0017 0000 xxxx 0001 - extended master secret, session ticket
0000 0a00 0a - signature algorithms
0008 xxxx - server_authz, have no idea what is it and don't want to know
0010 000e 000c 02 683208 687474702f312e31 - ALPN (h2, http/1.1)

my decoding journey continues

gugu commented 1 year ago
16 - message type
0301 - message version
0200 - content length

01 - client hello message
0001fc - client hello length
0303 - client hello protocol version
125c42b2d5c2396d61c9be6c6e189eefb211d64bdafe85ba6131cabc8aedb2f8 - random value
20 - session id length
8a08d47ae46281fefdfb732c131ecd7c58dba0fb95f842a292ef6ac584f723c4 - session id
0020 - cipher suites length
dada - GREASE compatibility check
1301 - TLS_AES_128_GCM_SHA256 (All of them are decoded by copilot, did not check if it is correct)
1302 - TLS_AES_256_GCM_SHA384
1303 - TLS_CHACHA20_POLY1305_SHA256
c02b - TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256
c02f - TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384
c02c - TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256
c030 - TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
cca9 - TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
cca8 - TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256
c013 - TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA
c014 - TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA
009c - TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256
009d - TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA384
002f - TLS_ECDHE_ECDSA_WITH_NULL_SHA
0035 - TLS_ECDHE_ECDSA_WITH_RC4_128_SHA
01 - number of compression methods to follow
00 - compression method "null"
0193 - length of extensions
8a8a0000 - GREASE compatibility check
0000 - server name extension
000e 000c 00 0009 xxxxxxxxxxxxxxxxxx - SNI xxxx.xxxx
0017 0000 ff01 0001 - extended master secret, session ticket
     0000 0a00 0a - signature algorithms
0008 eaea 001d 0017 0018 000b 0002 0100 0023 0000 - server_authz, have no idea what is it
0010 000e 000c 02 683208 687474702f312e31 - ALPN (h2, http/1.1)
0005 0005 0100000000 # OCSP
000d 0012 0010 040308040401050308050501080606010012 00000033002b0029eaea000100
001d 00203f782c482d92e2d7f8e7dcfee21effeaa128b30614bc01f200e051cb57ceaa6b002d00020101002b000b0ababa0304030303020301001b0003020002446900050003026832aaaa000100001500ca00
000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000f3e774

Here is a full message. I tried to repeat it locally in a single thread in a loop 1M iterations, nothing fails. I'll try running in parallel

bnoordhuis commented 1 year ago

Just a quick note but this:

0017 0000 ff01 0001 - extended master secret, session ticket 0000 0a00 0a - signature algorithms

Isn't right. 0017 0000 is the extended master secret extension, ff01 0001 00 is Renegotiation Indication with a (expected for a new connection) empty payload, and then 000a is EC Supported Groups.

gugu commented 1 year ago

I've checked 3 dumps, all of them have perfectly valid ALPN record. My guess now it is some use after free in ALPN callback

gugu commented 1 year ago

One more note: %rax points to incorrect memory address. Can not reproduce it locally (still happens in prod). My C++ knowledge is poor and my asm is even worse, can:

  1. TLSWrap* w = static_cast<TLSWrap*>(arg); - arg be destroyed when callback happens?
  2. w->alpn_protos_ be destroyed when callback happens?
  3. out point to incorrect place? (I think no because it prepared by OpenSSL before each callback call)

Fails always this line:

=> 0x00000000017f0bfc <+76>:    movzbl (%rax),%eax

PS: I see src: optimize ALPN callback commit in node 18.13 (bf17f8dcb8509f3231348a1c20dcf832f9a3c5ab), I'll try to deploy 18.12 and tell the status

gugu commented 1 year ago

@bnoordhuis yep, reverting to version before bf17f8dcb8509f3231348a1c20dcf832f9a3c5ab fixes the problem (18.12.0)

gugu commented 1 year ago
image

I think reverting this change will help

bnoordhuis commented 1 year ago

It seems really unlikely to me that bf17f8d is the culprit but maybe it exposes a preexisting condition.

Apropros that diff, can you explain why you think it makes a difference?

gugu commented 1 year ago

It makes a difference according to my production as well, 18.12.0 has 0 core dumps since yesterday on 10 servers. Default ALPN produced core dump around once per hour, empty ALPN - once per few hours.

  1. According to core dump it tries to write in piece of memory not available for the application. When I try to check %rax registry, it says that memory at that address in not avaialble
  2. The issue does not happen in stress test, but on production there are a lot of memory allocations and gc is frequently called
  3. This comment describes workaround of exactly the problem happens on production

I think it can be either GC running in a wrong time or client closing connection immediately after sending client hello. My service is a link shortener and we have tons of short living requests

bnoordhuis commented 1 year ago

But why specifically do you think retaining a reference to options.ALPNProtocols as ssl._secureContext.alpnBuffer makes a difference? Nothing references it.

gugu commented 1 year ago

Okay, I can not prove 100% it is this line. Do you have config params you use to build nodejs.org/download/... builds? I'll try to create builder docker container and try to remove this line

richardlau commented 1 year ago

See https://github.com/nodejs/node/blob/0e79635c4edfde5528b10962ca8d0b51a92f2474/Makefile#L1244-L1270 for the parameters passed to configure for releases. FYI CONFIG_FLAGS is set to --verbose.

skgith1 commented 1 year ago

@bnoordhuis yep, reverting to version before bf17f8d fixes the problem (18.12.0)

This happens to us on production as well (but not on our dev/test servers). Right now running 18.12.1 on production with no crashing has been stable for weeks. Soon as we load any version >18.13.0 it crashes usually within 24 hours.

ljp0101 commented 1 year ago

Thanks for the tips. Having similar problems and will revert my node version. Speculation is way above my knowledge but wondering if this might be related to 18.16.0 being compiled against openssl 3.0.8 but ubuntu 22.04 running openssl 3.0.2. I'm only using tls for kafkajs messages and the segfaults are purely a function of the number of messages, not message size or time (my setup sends all messages to one container for task a and splits the same messages across four containers for task b, and the number on segfaults from the one task a container equals the sum of the number of segfaults on four task b containers).

Couple different reverse stack traces if it helps anyone. Problem always presents itself through tls functions for me but different ones and segfault address isn't always 0x0.

node(_start+0x2e)[0xabbdee]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7f748dacbe40]
/lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7f748dacbd90]
node(_ZN4node5StartEiPPc+0x2ef)[0xb3976f]
node(_ZN4node22LoadSnapshotDataAndRunEPPKNS_12SnapshotDataEPKNS_20InitializationResultE+0xb8)[0xb35bc8]
node(_ZN4node16NodeMainInstance3RunEv+0xf4)[0xbc1164]
node(_ZN4node13SpinEventLoopEPNS_11EnvironmentE+0x14d)[0xabda6d]
node(uv_run+0x1d0)[0x165f500]
node[0x1666dc9]
node(_ZN4node11Environment14CheckImmediateEP10uv_check_s+0x76)[0xb01fa6]
node(_ZN4node11Environment27RunAndClearNativeImmediatesEb+0x1b8)[0xb019a8]
node[0xd20668]
node(_ZN4node6crypto7TLSWrap6EncOutEv+0x115)[0xd1f835]
node(_ZN4node6crypto7TLSWrap12InvokeQueuedEiPKc+0xb7)[0xd1e847]
/lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f748dae4520]
/app/node_modules/segfault-handler/build/Release/segfault-handler.node(+0x372d)[0x7f748c58e72d]
PID 1 received SIGSEGV for address: 0x0
node(_start+0x2e)[0xabbdee]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7fc21ca30e40]
/lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7fc21ca30d90]
node(_ZN4node5StartEiPPc+0x2ef)[0xb3976f]
node(_ZN4node22LoadSnapshotDataAndRunEPPKNS_12SnapshotDataEPKNS_20InitializationResultE+0xb8)[0xb35bc8]
node(_ZN4node16NodeMainInstance3RunEv+0xf4)[0xbc1164]
node(_ZN4node13SpinEventLoopEPNS_11EnvironmentE+0x14d)[0xabda6d]
node(uv_run+0x1d0)[0x165f500]
node[0x1666dc9]
node(_ZN4node11Environment14CheckImmediateEP10uv_check_s+0x76)[0xb01fa6]
node(_ZN4node11Environment27RunAndClearNativeImmediatesEb+0x1b8)[0xb019a8]
node[0xd20668]
node(_ZN4node6crypto7TLSWrap6EncOutEv+0x115)[0xd1f835]
node(_ZN4node6crypto7TLSWrap12InvokeQueuedEiPKc+0x173)[0xd1e903]
node(_ZThn24_N4node15SimpleWriteWrapINS_9AsyncWrapEED0Ev+0x22)[0xc5c3f2]
node(_ZN4node9AsyncWrapD1Ev+0x52)[0xad3b32]
node(_ZN4node10BaseObjectD1Ev+0x13d)[0xad4e4d]
node(_ZN2v88internal13GlobalHandles9NodeSpaceINS1_4NodeEE7ReleaseEPS3_+0x2a)[0xedf6aa]
/lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7fc21ca49520]
/app/node_modules/segfault-handler/build/Release/segfault-handler.node(+0x372d)[0x7fc2156ef72d]
PID 1 received SIGSEGV for address: 0x19
node(_start+0x2e)[0xabbdee]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7f7b3be88e40]
/lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7f7b3be88d90]
node(_ZN4node5StartEiPPc+0x2ef)[0xb3976f]
node(_ZN4node22LoadSnapshotDataAndRunEPPKNS_12SnapshotDataEPKNS_20InitializationResultE+0xb8)[0xb35bc8]
node(_ZN4node16NodeMainInstance3RunEv+0xf4)[0xbc1164]
node(_ZN4node13SpinEventLoopEPNS_11EnvironmentE+0x14d)[0xabda6d]
node(uv_run+0x1d0)[0x165f500]
node[0x1666dc9]
node(_ZN4node11Environment14CheckImmediateEP10uv_check_s+0x76)[0xb01fa6]
node(_ZN4node11Environment27RunAndClearNativeImmediatesEb+0x1b8)[0xb019a8]
node[0xd20668]
node(_ZN4node6crypto7TLSWrap6EncOutEv+0x115)[0xd1f835]
node(_ZN4node6crypto7TLSWrap12InvokeQueuedEiPKc+0x173)[0xd1e903]
node(_ZThn24_N4node15SimpleWriteWrapINS_9AsyncWrapEED0Ev+0x22)[0xc5c3f2]
node(_ZN4node9AsyncWrapD1Ev+0x52)[0xad3b32]
node(_ZN4node10BaseObjectD1Ev+0x13d)[0xad4e4d]
node(_ZN2v88internal13GlobalHandles9NodeSpaceINS1_4NodeEE7ReleaseEPS3_+0x2a)[0xedf6aa]
/lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f7b3bea1520]
/app/node_modules/segfault-handler/build/Release/segfault-handler.node(+0x372d)[0x7f7b38b4872d]
PID 1 received SIGSEGV for address: 0xffffffffffffff18
gugu commented 1 year ago

if this might be related to 18.16.0 being compiled against openssl 3.0.8 but ubuntu 22.04 running openssl 3.0.2

no, node vendors custom version of openssl

bnoordhuis commented 1 year ago

Although it is possible to link node against a different copy of openssl (and e.g. ubuntu builds node that way.)

ldd /path/to/node and node -p process.versions will tell you.

jamesfdickinson commented 1 year ago

I have been experiencing crashes with the same error. I have not been able to reproduce the crash in our dev environment. It happens once or twice a day. I will include my setup in hope it may helps find the root cause. When I revert back to node 18.12.1 I see no crashes.

Ubuntu 22.04.1 LTS Node 20

"dependencies": { "cors": "^2.8.5", "express": "^4.18.2", "got": "^12.6.0", "seedrandom": "^3.0.5", "segfault-handler": "^1.3.0", "ws": "^8.13.0" },

Error PID 1 received SIGSEGV for address: 0x0 /usr/src/app/node_modules/segfault-handler/build/Release/segfault-handler.node(+0x3236)[0x7f7002483236] /lib/x86_64-linux-gnu/libpthread.so.0(+0x13140)[0x7f700213d140] node(SSL_select_next_proto+0x4c)[0x19341bc] node[0xd95bc0] node(tls_handle_alpn+0x53)[0x1978543] node(tls_parse_all_extensions+0x143)[0x195acb3] node(tls_post_process_client_hello+0x70)[0x19787c0] node[0x1966134] node(ssl3_read_bytes+0x320)[0x1954610] node(ssl3_read+0x60)[0x1924200] node(SSL_read+0x87)[0x19319b7] node(_ZN4node6crypto7TLSWrap8ClearOutEv+0x7a)[0xd9e2fa] node(_ZN4node6crypto7TLSWrap12OnStreamReadElRK8uv_buf_t+0xf8)[0xd9f078] node(_ZN4node15LibuvStreamWrap8OnUvReadElPK8uv_buf_t+0x92)[0xce05e2] node[0xce0a0b] node[0x179a4a7] node[0x179acd0] node[0x17a0a74] node(uv_run+0x14e)[0x178ee9e] node(_ZN4node21SpinEventLoopInternalEPNS_11EnvironmentE+0x14d)[0xb23a3d] node(_ZN4node16NodeMainInstance3RunEv+0xf4)[0xc2ac24] node(_ZN4node22LoadSnapshotDataAndRunEPPKNS_12SnapshotDataEPKNS_24InitializationResultImplE+0x6e)[0xb9d52e] node(_ZN4node5StartEiPPc+0x261)[0xba1071] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xea)[0x7f7001f78d0a] node(_start+0x2e)[0xb219ee] Signal: 11 Address: 0

LewisWolfgang commented 1 year ago

@bnoordhuis : My knowledge of openSSL internals is a bit rusty, but I believe the problem here was introduced with https://github.com/nodejs/node/commit/bf17f8dcb8509f3231348a1c20dcf832f9a3c5ab#. Previously the TLSWrap pointer was stored with the SSL object and used via (SSL_get_app_data(s)).

In the new code, the TLSWrap pointer is stored as a fixed arg on the SSL_CTX object, which IIUC, is shared by multiple TLSWrap objects. (So, if the last one to update it goes away, bad things happen.)

Like the other commenters, we have been unable to reproduce this under controlled circumstances, but it happened routinely when we tries to deploy node 18.17.1 in a production environment. A deep dive into the resulting core dump showed that at the time of the crash, the TLSWrap and SSL objects involved looked fine (they were not deleted or anything), but ssl->ctx->ext.alpn_select_cb_arg was pointing to some memory that no longer appeared to be a TLSWrap object.

bnoordhuis commented 1 year ago

Right, that sounds plausible. Can you see if #49635 fixes it for you? I tried writing a regression test but I'm not actually able to trigger the bug.

LewisWolfgang commented 1 year ago

Thank you so much for responding quickly.

As I said, I've been unable to repro this under test conditions. But maybe now that we know what's happening, it may be possible to contrive a situation to reproduce it... It's a pretty narrow window of time for a second connection to come in, change the callback arg, finish up, get GC'ed, and the memory get reused, all before the client hello is parsed. I'm going to try slamming a server with a combination of connections with very small and very large client hellos to see if the small ones can corrupt things for the large ones....

In terms of a more typical regression test... I think it's possible to drive this machinery from the JS side of things without using any actual sockets/networking... it might be possible to do it that way, but honestly, I'm not familiar enough with the code to know how to do that.

bnoordhuis commented 1 year ago

My test is basically: start two servers with different ALPN callback but sharing the same SecureContext through server.addContext('*', context), then connect to the first one.

You'd expect to end up in the second one's ALPN callback but that doesn't happen (or vice versa.)

LewisWolfgang commented 1 year ago

I'm still a bit fuzzy on how all this works, but doesn't the callback arg in the SecureContext get updated whenever a connection comes in? So in order to get the wrong one, a connection on the other one would need to come in between the socket open and the parsing of client hello?

bnoordhuis commented 1 year ago

Could be. I tried stress testing but that didn't change anything.

LewisWolfgang commented 1 year ago

I still haven't been able to reproduce the issue. After reading the code more carefully, I don't believe the SecureContext objects should be reused in our application. (It's a pretty basic websocket server, as far as this stuff goes.) I am a bit confused about the fact that the context can be changed in the SNI callback. I'm having trouble following the logic there and knowing when it might be used. (We aren't doing anything with virtual servers, but I'm not sure if the callback may be called based on some errant/malicious client input...). But this is probably a red herring...

LewisWolfgang commented 1 year ago

@bnoordhuis : I can confirm that if I do just a basic server like:

      const httpB = https.createServer(
        {
          key: key,
          cert: cert,
          ciphers: CIPHER_SUITE,
        }
      );
      httpB.listen(48000, () => {
        console.log(`HACK server listening on 48000 with tls`);
      });

And have 2 client connections do the following: open connection A wait 1 second open connection B send client hello on B close B wait 50 seconds send client hello on A

Run this with gdb attached, and break on node::crypto::(anonymous namespace)::SelectALPNCallback. They both get the same value in r9 (the callback arg stored in SSL_CTX).

LewisWolfgang commented 1 year ago

And IIUC, that agrees with the theory... the TLSWrap objects will share an SSL_CTX... it's the one associated with server._sharedCreds in the JS side of things.

bnoordhuis commented 1 year ago

@LewisWolfgang that looks plausible, but it's probably too non-determinate to turn into a reliable regression test (not that I mind being proven wrong)

LewisWolfgang commented 1 year ago

Yeah. In an ideal world, objects (TLSWrap) associated with a "child" SSL object shouldn't be making changes to a "parent", shared SSL_CTX object. But as long as all the child objects are making the same change (just a pointer to the class callback), things are fine, and it was working this way for a long time. (Writing a pointer to a specific child instance into the parent has the bad consequences that one might expect...) I can imagine that in the longer term, some sort of refactoring to avoid the "child writing into parent" pattern could be a good idea. In the short term, I think your PR https://github.com/nodejs/node/pull/49635 is the way to go.

sempi commented 1 year ago

We can confirm that this issue still exists on node v18.18.0 and node v20.8.0.

We can also confirm that setting ALPNProtocols: [] does NOT resolve the issue.

About 1 of 2M HTTP requests in a representative production environment crashes the node.js process. Our production workload includes HTTP connection from almost all active browser versions. We do not believe this are malicious attempts as the failure rate follows seasonal patterns so its just a function of how many connections are made overall.

sempi commented 1 year ago

We can confirm that the fix resolves this crash in node 20.8.0.

https://github.com/nodejs/node/pull/49635

bnoordhuis commented 10 months ago

We can confirm that the fix resolves this crash in node 20.8.0.

For posterity: the fix hasn't been released in v20.x yet. v20.10.0 is about to be released and contains the fix.

nromito commented 10 months ago

@bnoordhuis are there plans to backport this to 18.x?

magnetnation commented 5 months ago

We are using v20.11.0. The error is still present.

ivosh commented 4 months ago

I confirm this bug is present with Node 18.18.2 and can be reproduced with 60% probability on the production system where there is quite a heavy load. I cannot reproduce it at all on the pre-production system which is almost identical.