nodejs / node

Node.js JavaScript runtime ✨🐢🚀✨
https://nodejs.org
Other
106.65k stars 29.08k forks source link

Core dump with "double free detected in tcache 2" or assertion failure when using http2 client via @grpc/grpc-js #38964

Open mdouglass opened 3 years ago

mdouglass commented 3 years ago

What steps will reproduce the bug?

repro-node-crash.zip Run the following steps on the same machine:

  1. npm ci
  2. npm run server
  3. npm run client

How often does it reproduce? Is there a required condition?

On my machine this is 100% reproducible in node v16.0 and up. I have not tried previous pre-v16 versions.

What is the expected behavior?

It is expected that the client process runs to completion without error.

What do you see instead?

The client process dies with a core dump or an assertion failure:

$ npm run client

> server@1.0.0 client
> node client.js

duplexEcho response: { message: 'Hello duplex world!' }
unaryEcho response: { message: 'Hello unary world!' }
duplexEcho error Error: 1 CANCELLED: Cancelled on client
    at Object.callErrorFromStatus (/home/matthew/spikes/repro-node-crash/node_modules/@grpc/grpc-js/build/src/call.js:31:26)
    at Object.onReceiveStatus (/home/matthew/spikes/repro-node-crash/node_modules/@grpc/grpc-js/build/src/client.js:390:49)
    at Object.onReceiveStatus (/home/matthew/spikes/repro-node-crash/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:299:181)
    at /home/matthew/spikes/repro-node-crash/node_modules/@grpc/grpc-js/build/src/call-stream.js:145:78
    at processTicksAndRejections (node:internal/process/task_queues:78:11)
duplexEcho status {
  code: 1,
  details: 'Cancelled on client',
  metadata: Metadata { internalRepr: Map(0) {}, options: {} }
}
free(): double free detected in tcache 2
[1]    695652 IOT instruction (core dumped)  npm run client
 npm start
> client@1.0.0 start
> node index.js
duplexEcho response: { message: 'Hello duplex world!' }
unaryEcho response: { message: 'Hello unary world!' }
duplexEcho error Error: 1 CANCELLED: Cancelled on client
    at Object.callErrorFromStatus (/Users/mark/oss/grpc-crash/repro-node-crash/client/node_modules/@grpc/grpc-js/build/src/call.js:31:26)
    at Object.onReceiveStatus (/Users/mark/oss/grpc-crash/repro-node-crash/client/node_modules/@grpc/grpc-js/build/src/client.js:390:49)
    at Object.onReceiveStatus (/Users/mark/oss/grpc-crash/repro-node-crash/client/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:299:181)
    at /Users/mark/oss/grpc-crash/repro-node-crash/client/node_modules/@grpc/grpc-js/build/src/call-stream.js:145:78
    at processTicksAndRejections (node:internal/process/task_queues:78:11)
duplexEcho status {
  code: 1,
  details: 'Cancelled on client',
  metadata: Metadata { internalRepr: Map(0) {}, options: {} }
}
node[57071]: ../src/node_http2.cc:450:void node::http2::Http2Session::CheckAllocatedSize(size_t) const: Assertion `(current_nghttp2_memory_) >= (previous_size)' failed.
 1: 0x1026e4ee5 node::Abort() (.cold.1) [/Users/mark/.nvm/versions/node/v16.0.0/bin/node]
 2: 0x10139adc9 node::Abort() [/Users/mark/.nvm/versions/node/v16.0.0/bin/node]
 3: 0x10139ac31 node::Assert(node::AssertionInfo const&) [/Users/mark/.nvm/versions/node/v16.0.0/bin/node]
 4: 0x1013d5885 node::mem::NgLibMemoryManager<node::http2::Http2Session, nghttp2_mem>::FreeImpl(void*, void*) [/Users/mark/.nvm/versions/node/v16.0.0/bin/node]
 5: 0x102278942 nghttp2_session_close_stream [/Users/mark/.nvm/versions/node/v16.0.0/bin/node]
 6: 0x102280266 nghttp2_session_mem_recv [/Users/mark/.nvm/versions/node/v16.0.0/bin/node]
 7: 0x1013bf719 node::http2::Http2Session::ConsumeHTTP2Data() [/Users/mark/.nvm/versions/node/v16.0.0/bin/node]
 8: 0x1013c2b8c node::http2::Http2Session::OnStreamRead(long, uv_buf_t const&) [/Users/mark/.nvm/versions/node/v16.0.0/bin/node]
 9: 0x10145da2b node::LibuvStreamWrap::OnUvRead(long, uv_buf_t const*) [/Users/mark/.nvm/versions/node/v16.0.0/bin/node]
10: 0x101da205e uv__stream_io [/Users/mark/.nvm/versions/node/v16.0.0/bin/node]
11: 0x101daae4c uv__io_poll [/Users/mark/.nvm/versions/node/v16.0.0/bin/node]
12: 0x101d97521 uv_run [/Users/mark/.nvm/versions/node/v16.0.0/bin/node]
13: 0x1012ccedf node::SpinEventLoop(node::Environment*) [/Users/mark/.nvm/versions/node/v16.0.0/bin/node]
14: 0x1013dc254 node::NodeMainInstance::Run(node::EnvSerializeInfo const*) [/Users/mark/.nvm/versions/node/v16.0.0/bin/node]
15: 0x10136973d node::Start(int, char**) [/Users/mark/.nvm/versions/node/v16.0.0/bin/node]
16: 0x7fff20564621 start [/usr/lib/system/libdyld.dylib]
17: 0x2
zsh: abort      npm start

Additional information

This is related to https://github.com/grpc/grpc-node/issues/1464.

There is a workaround available in my example client.js if you comment out line 45 and comment in line 46 (delaying stream.cancel via the use of setImmediate).

It seems to require this particular pattern of calls in the @grpc/grpc-js library as well:

  1. Starting a duplex grpc call stream
  2. Sending at least one request/response over the duplex stream
  3. Starting and completing a unary grpc call
  4. Cancelling the duplex stream
aduh95 commented 3 years ago

Any chance you could provide a repro that doesn't include npm dependencies (or downloading a potentially sketchy zip file 🙃)?

bl-ue commented 3 years ago

Indeed, I can reproduce on macOS 11.4 with node v15.5.1 (that's what I last had 🙂).

❯ npm run client

> server@1.0.0 client
> node client.js

duplexEcho response: { message: 'Hello duplex world!' }
unaryEcho response: { message: 'Hello unary world!' }
duplexEcho error Error: 1 CANCELLED: Cancelled on client
    at Object.callErrorFromStatus (/Users/me/Downloads/repro-node-crash/node_modules/@grpc/grpc-js/build/src/call.js:31:26)
    at Object.onReceiveStatus (/Users/me/Downloads/repro-node-crash/node_modules/@grpc/grpc-js/build/src/client.js:390:49)
    at Object.onReceiveStatus (/Users/me/Downloads/repro-node-crash/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:299:181)
    at /Users/me/Downloads/repro-node-crash/node_modules/@grpc/grpc-js/build/src/call-stream.js:145:78
    at processTicksAndRejections (node:internal/process/task_queues:75:11)
duplexEcho status {
  code: 1,
  details: 'Cancelled on client',
  metadata: Metadata { internalRepr: Map(0) {}, options: {} }
}
node(48717,0x104f70e00) malloc: *** error for object 0x104b31db0: pointer being freed was not allocated
node(48717,0x104f70e00) malloc: *** set a breakpoint in malloc_error_break to debug
zsh: abort      npm run client

I'll investigate some more and when I come to a conclusion, I'll pull in some guys.

bl-ue commented 3 years ago

~@mdouglass is the problem (what's not expected) the error itself duplexEcho error Error: 1 CANCELLED: Cancelled on client (and its stacktrace) or the core dump? Or both?~ Never mind, I see the core dump is the proble.

bl-ue commented 3 years ago

It seems like a bug in HTTP2. cc @jasnell @addaleax because you're familiar with HTTP2 in node. I think the line that causes the abort is https://github.com/grpc/grpc-node/blob/663fe77e7286910da0ae067cde5622dd2036bb30/packages/grpc-js/src/call-stream.ts#L679

mdouglass commented 3 years ago

Any chance you could provide a repro that doesn't include npm dependencies (or downloading a potentially sketchy zip file upside_down_face)?

Apologies, I thought the zip of the repro would be most appropriate. I've posted the repo at https://github.com/mdouglass/repro-node-crash as well if you'd rather get it there.

I don't know enough about the @grpc/grpc-js internals to be able to create a repro without that lib quickly. If it turns out to be necessary to investigate it further, I can try to figure it out.

bl-ue commented 3 years ago

@mdouglass save yourself the time — I'm investigating right now

mdouglass commented 3 years ago

It seems like a bug in HTTP2. cc @jasnell @addaleax because you're familiar with HTTP2 in node. I think the line that causes the abort is https://github.com/grpc/grpc-node/blob/663fe77e7286910da0ae067cde5622dd2036bb30/packages/grpc-js/src/call-stream.ts#L679

That reminded me that I should have tried to run this with the grpc-node tracing turned on via:

export GRPC_TRACE=all
export GRPC_VERBOSITY=DEBUG

If I do that, the last log line out is indeed the one immediately before https://github.com/grpc/grpc-node/blob/663fe77e7286910da0ae067cde5622dd2036bb30/packages/grpc-js/src/call-stream.ts#L679 that @bl-ue identified.

See https://gist.github.com/mdouglass/a4c521167daf8e5e0246bb862b419f36

bl-ue commented 3 years ago

Yep, I tried those vars too which is what got me that line. I'm building node now with a bunch of debug params to see if I can spot anything obvious before digging in to the node source more:

./configure --debug --debug-node --gdb --v8-non-optimized-debug --v8-enable-object-print --debug-nghttp2 --node-builtin-modules-path=$PWD/lib --enable-asan --debug-lib
bl-ue commented 3 years ago

So, after the long (37min) build, I tried running the server which worked like normal, and the client eventually (this 393mb node takes an extraordinarily long time to start up 😄) started executing, and printed this error instead of the core dump or the assertion failure (both of which I've gotten earlier and got a bunch more on the next runs):

Assertion failed: (stream->dep_prev), function nghttp2_stream_dep_remove, file ../deps/nghttp2/lib/nghttp2_stream.c, line 777.
zsh: abort      npm run client

@jasnell does that look like a real problem or some unrelated issue caused by debugging? https://github.com/nodejs/node/blob/e46c680bf2b211bbd52cf959ca17ee98c7f657f5/deps/nghttp2/lib/nghttp2_stream.c#L777

bl-ue commented 3 years ago

ping @jasnell

mdouglass commented 3 years ago

I just built with ./configure --debug --enable-asan. After that I do see an ASAN heap-use-after-free. See https://gist.github.com/mdouglass/18514a2667572d742ef2c2d38f45249c

bl-ue commented 3 years ago

Wow, that's a lot of cool output — I didn't get any of that 🤔

==2805953==ERROR: AddressSanitizer: heap-use-after-free on address 0x6110000b7e74 at pc 0x000006292f8b bp 0x7ffc581e67b0 sp 0x7ffc581e67a8
READ of size 1 at 0x6110000b7e74 thread T0
    #0 0x6292f8a in nghttp2_session_close_stream ../deps/nghttp2/lib/nghttp2_session.c:1218
    #1 0x6293def in nghttp2_session_close_stream_if_shut_rdwr ../deps/nghttp2/lib/nghttp2_session.c:1442
    #2 0x629b653 in session_end_stream_headers_received ../deps/nghttp2/lib/nghttp2_session.c:3738
    #3 0x629bc7f in session_after_header_block_received ../deps/nghttp2/lib/nghttp2_session.c:3826
    #4 0x62a689e in nghttp2_session_mem_recv ../deps/nghttp2/lib/nghttp2_session.c:6277
    #5 0x140f19d in node::http2::Http2Session::ConsumeHTTP2Data() ../src/node_http2.cc:798
    #6 0x1418f2c in node::http2::Http2Session::OnStreamRead(long, uv_buf_t const&) ../src/node_http2.cc:1883
    #7 0x12375c4 in node::StreamResource::EmitRead(long, uv_buf_t const&) ../src/stream_base-inl.h:104
    #8 0x1720366 in node::LibuvStreamWrap::OnUvRead(long, uv_buf_t const*) ../src/stream_wrap.cc:276
    #9 0x171fb08 in operator() ../src/stream_wrap.cc:198
    #10 0x171fb38 in _FUN ../src/stream_wrap.cc:199
    #11 0x523b8c5 in uv__read ../deps/uv/src/unix/stream.c:1239
    #12 0x523bfa9 in uv__stream_io ../deps/uv/src/unix/stream.c:1306
    #13 0x524e7d7 in uv__io_poll ../deps/uv/src/unix/linux-core.c:462
    #14 0x520c612 in uv_run ../deps/uv/src/unix/core.c:385
    #15 0x107ae80 in node::SpinEventLoop(node::Environment*) ../src/api/embed_helpers.cc:36
    #16 0x14ca670 in node::NodeMainInstance::Run(node::EnvSerializeInfo const*) ../src/node_main_instance.cc:143
    #17 0x12aa567 in node::Start(int, char**) ../src/node.cc:1125
    #18 0x5e68beb in main ../src/node_main.cc:127
    #19 0x7fa6c1feab74 in __libc_start_main (/lib64/libc.so.6+0x27b74)
    #20 0x106d18d in _start (/home/matthew/projects/node/out/Debug/node+0x106d18d)

Hmm, that almost looks like a bug in nghttp2 — but I'm not experienced enough to say that.

ping @jasnell I just know you're the best to take a look at this :)

mdouglass commented 3 years ago

Adding a bit more context, I turned on the nghttp2 debug as well and got this: https://gist.github.com/mdouglass/fe9a159b6d9b167ac1da14386bf773a5.

I'm suspicious of this:

stream: stream(0x6110000b7ec8)=3 close
unaryEcho response: { message: 'Hello unary world!' }
calling stream.cancel
...
stream: stream(0x6110000b7ec8)=3 close

I'm seeing that the same stream was closed twice and that the address of that stream is within a short distance of where the heap-use-after-free. The first close is right before I receive the unary response and the second close is right after I call cancel on the duplex stream. So to my eyes that looks like the two streams getting confused somehow. Will keep poking at it.

mdouglass commented 3 years ago

Continuing to poke at this as I can. I added logging at the top and bottom of nghttp2_session_close_stream so I could see when we enter/exit calls to this method. In the working version of the repro, we enter nghttp2_session_close_stream, exit it and then enter and exit it again.

In the non-working version of the repro, we enter nghttp2_session_close_stream and then re-enter it on the same thread:

bad version:

stream: thread=140488664597056 stream(unknown)=3 request close <-- initial entry
stream: thread=140488664597056 session(0x61e000009088) stream(0x6110000a9c88)=3 close
CLIENT: Received unaryEcho result
CLIENT: unaryEcho response: { message: 'Hello unary world!' }
CLIENT: calling stream.cancel
stream: stream=1 resume item=0x60e00002af28 flags=08
stream: stream=1 obq push cycle=0
stream: push stream 1 to stream 0
stream: adjusting kept idle streams num_idle_streams=0, max=100
send: next frame: payloadlen=4, type=3, flags=0x00, stream_id=3
send: start transmitting frame type=3, length=13
stream: thread=140488664597056 stream(unknown)=3 request close <-- re-entry!
stream: thread=140488664597056 session(0x61e000009088) stream(0x6110000a9c88)=3 close
stream: destroy closed stream(0x6110000a9c88)=3
stream: dep_remove stream(0x6110000a9c88)=3
stream: thread=140488664597056 session(0x61e000009088) stream(0x6110000a9c88)=3 normal return <-- exit

The second call never exits as it ultimately crashes with the use after free call seen above.

I tried reading through the nghttp2 docs to see if this is allowed and can't see if it is allowed or not. There is a comment that a single nghttp2_session object must be used by a single thread (which is why I logged threadid) but nothing about reentrancy.

edit: Ignore the unknown in the request close line, I am logging that before retrieving the stream pointer since I wanted to make sure the log line was as early as possible.

mdouglass commented 3 years ago

So I was trying to figure out how we ended up re-entrant. It looks like: nghttp2_session_close_stream calls session->callbacks.on_stream_close_callback that maps to Http2Session::OnStreamClose which calls env->http2session_on_stream_close_function

I can't tell (yet) what that maps to, but inside that code we seem to transfer control back to the javascript side and we see the unaryEcho callback processed, which resolves the promise and allows main to call stream.cancel and that causes the reentrancy:

stream: thread=140651786183232 stream(unknown)=3 request close
stream: thread=140651786183232 session(0x61e000009088) stream(0x6110000a9c88)=3 close
Http2Session client (26) stream 3 closed with code: 0
Http2Session client (26) stream 3 Calling stream->Close(0)
HttpStream 3 (33) [Http2Session client (26)] closed with code 0
Http2Session client (26) stream 3 Returned stream->Close(0)
Http2Session client (26) stream 3 Calling env->http2session_on_stream_close_function
HttpStream 3 (33) [Http2Session client (26)] destroying stream
HttpStream 3 (33) [Http2Session client (26)] destroying stream
CLIENT: Received unaryEcho result
CLIENT: unaryEcho response: { message: 'Hello unary world!' }
CLIENT: calling stream.cancel
stream: stream=1 resume item=0x60e00002af28 flags=08
stream: stream=1 obq push cycle=0
stream: push stream 1 to stream 0
HttpStream 1 (29) [Http2Session client (26)] writable side shutdown
HttpStream 1 (29) [Http2Session client (26)] sending rst_stream with code 8
Http2Session client (26) sending pending data
stream: adjusting kept idle streams num_idle_streams=0, max=100
send: next frame: payloadlen=4, type=3, flags=0x00, stream_id=3
send: start transmitting frame type=3, length=13
stream: thread=140651786183232 stream(unknown)=3 request close
mdouglass commented 3 years ago

I think this is probably as far as I can take it without finding out on the nghttp2 side whether or not the API is supposed to be re-entrant.

If it is, then this seems to become a bug in nghttp2.

If it is not, then this seems like a flaw in node_http2.cc which would need to specifically delay any callbacks to JavaScript code until after we finish processing the callback from nghttp2 to node_http2 in order to prevent any chance of JavaScript code triggering reentrancy.

@jasnell would really appreciate you weighing in or or pointing me to someone appropriate to hound. Thanks :pray:

mdouglass commented 3 years ago

Last note on the reentrancy aspect -- both the good/bad repro cases are making two calls to nghttp2_session_close_stream(session, 3, 0).

In the good case, the second call is not reentrant and the call to nghttp2_session_get_stream at the start of that method returns NULL, so nghttp2_session_close_stream returns immediately doing nothing.

In the bad case, the second call is reentrant (it happens during the on_stream_close_callback of the first call) and the second call runs to completion calling nghttp2_session_destroy_stream before it exits which frees the memory for the stream. Control exits and returns into the first call which now triggers the ASAN violation when it next attempts to access the stream at if ((stream->flags & NGHTTP2_STREAM_FLAG_PUSH)).

If you want to support reentrancy it seems like we need to mark the stream as in-process closing (maybe via stream->flags?) so that the reentrant call can detect it is reentrant on the close and noop.

mdouglass commented 3 years ago

So, according to https://github.com/nghttp2/nghttp2/issues/1590 and https://nghttp2.org/documentation/programmers-guide.html#remarks, it is not legal in nghttp2 to call these functions from an nghttp2 callback: nghttp2_session_send, nghttp2_session_mem_send, nghttp2_session_recv or nghttp2_session_mem_recv.

In node, when in the on_stream_close_callback Http2Session::OnStreamClose, we transition into JavaScript code when calling the http2session_on_stream_close_function callback function. That allows the JavaScript code to do anything it wants and potentially trigger a reentrant call into nghttp2.

In this particular bug, I close an nghttp2 stream, which ultimately causes Http2Stream::SubmitRstStream to be invoked, which in turn calls Http2Session::SendPendingData which makes the reentrant call to nghttp2_session_mem_send violating nghttp2's rules. I have verified this by commenting out the code in SubmitRstStream which does in fact prevent the crash.

A fix for the problem is a more complex issue. I believe the safest approach would be to set a flag when entering/exiting any of the nghttp2 callbacks and have SendPendingData return early if that flag is set. This seems similar in spirit to the existing check in SendPendingData around is_sending to prevent itself from being called recursively.

An alternative would be to just address the identified problem with close and have SubmitRstStream use SetImmediate to escape the nghttp2 callback's stack. My gut says that although this would be more targeted, it is likely a less appropriate fix.

I'm happy to try to make this change, but I'd love someone to verify that there's not a better approach or something I'm missing before I get started.

kumarak commented 3 years ago

@mdouglass, I raised PR #39423 to fix the double-free error but need advice from someone having experience with nghttp2. The issue seems to be because of the handling of the RST_STREAM frame received with no error code and cancel error code. The node tries to force process it and purge any existing data for the stream. This causes nghttp2 to close the already destroyed stream causing the double-free error. I have a limited understanding of nghttp2 and can't explain such behavior.

The workaround solution for this could be not to force purge the data on receiving RST_STREAM with cancel code and add it to the pending list for safe processing. It avoids the double-free error. Although, I still see some of the flaky test cases failing. Could you look into the changes and let me know your feedback?

kapouer commented 3 years ago

Hi @mdouglass i still reproduce free(): double free detected in tcache 2 with your repository and the official nodejs binary 12.22.4.

kumarak commented 3 years ago

Hi @kapouer, Do you see it only with v12.22.4? The patch was backported for v12.x. I will test it with the repo.

kapouer commented 3 years ago

@kumarak i see it with 12.21.0 and 12.22.4 when built for debian (shared libs...) and with official 12.22.4 binary. The repo need a simple require("assert/strict") -> require("assert") fix to work.

kumarak commented 3 years ago

Thanks, @kapouer for reporting! I had a different version of scripts causing the same double-free error and reported. That got fixed but it seems the issue still exists.

The reason for the issue is same. The handling of RST_STREAM does force-purging of data and causing the double-free error. The previous patch avoids it in limited cases (if the stream is reading), but there could be other stream states as well causing the issue(reading & paused, paused, maybe some cases of writing as well). An ideal solution could be to delay the stream in all cases if the session is in scope. But I am not sure of the side-effect of the change.

tcullum-rh commented 3 years ago

Hi @mdouglass i still reproduce free(): double free detected in tcache 2 with your repository and the official nodejs binary 12.22.4.

So just to confirm, this is essentially stating that the fix release does not actually fix this?

kumarak commented 3 years ago

Hi @tcullum-rh, The issue is not completely fixed with the release. There could still be a version of the scripts that can cause double-free issues. The previous fix had checks based on stream state which is not perfect and can be bypassed. I also find the stream states are not consistent with the different versions of node LTS.

I think the best solution is to delay the handling of streams on receiving all rst_stream frames with the error code NGHTTP2_CANCEL. That will not let it depend on the stream state. I raised a PR with the changes #39622.

The reason I choose to handle rst_stream with cancel error code differently is that the issue only occurs on receiving frames with cancel code. Also, it is probably not good to delay the handling of streams on receiving all frames since they may be raised due to protocol/connection error and should be handled immediately.

SimonWoolf commented 2 years ago

I still see this crash in node 16.14.2, which includes the second fix that @kumarak did https://github.com/nodejs/node/pull/39622 , when using grpc-js with tls. (NB: only reproducible with tls; the same loadtest with tls turned off has no crashes).

Stacktrace is either

node[1]: ../src/node_http2.cc:460:void node::http2::Http2Session::CheckAllocatedSize(size_t) const: Assertion `(current_nghttp2_memory_) >= (previous_size)' failed.
 1: 0xafa3a0 node::Abort() [node]
 2: 0xafa414  [node]
 3: 0xb2e23c node::mem::NgLibMemoryManager<node::http2::Http2Session, nghttp2_mem>::ReallocImpl(void*, unsigned long, void*) [node]
 4: 0xb2e3a8 node::mem::NgLibMemoryManager<node::http2::Http2Session, nghttp2_mem>::FreeImpl(void*, void*) [node]
 5: 0x1974308 nghttp2_session_close_stream [node]
 6: 0x197b5c4 nghttp2_session_mem_recv [node]
 7: 0xb23e50 node::http2::Http2Session::ConsumeHTTP2Data() [node]
 8: 0xb241c4 node::http2::Http2Session::OnStreamRead(long, uv_buf_t const&) [node]
 9: 0xbceb20 node::LibuvStreamWrap::OnUvRead(long, uv_buf_t const*) [node]
10: 0x14cf408  [node]
11: 0x14cfc18  [node]
12: 0x14d59b8  [node]
13: 0x14c4f80 uv_run [node]
14: 0xa48114 node::SpinEventLoop(node::Environment*) [node]
15: 0xb358f0 node::NodeMainInstance::Run(node::EnvSerializeInfo const*) [node]
16: 0xac3380 node::Start(int, char**) [node]
17: 0xffffabf63090 __libc_start_main [/lib/aarch64-linux-gnu/libc.so.6]
18: 0xa4647c  [node]

or in other cases, just

free(): double free detected in tcache 2

I have no specific event orderings that reproduce, but I can make it happen reasonably reliably it by running a loadtest which involves a lot of grpc call/client work and churn.

mcollina commented 2 years ago

@kumarak could you take a look?

kumarak commented 2 years ago

@SimonWoolf, Do you have a script that can help me reproduce the issue with v16.14.2? The fix delays the processing of RST_STREAM in a specific case to avoid the double free due to the unwanted behavior of nghttp2. The check might be getting bypassed. I will relook into the issue and discuss with the nghttp2 community about the correct fix.

SimonWoolf commented 2 years ago

I'm afraid I don't have a script that I'm able to share, sorry (it's a loadtest of a non-open-source distributed system, I haven't been able to generate a minimal reproduction).

But if you have some debug flags I could turn on, or a custom build of node with extra logging or something that you'd like me to run the tests against to get more information when the crash happens, I'd be happy to do so.

facutuesca commented 2 years ago

@SimonWoolf I'm looking into this issue right now. Is there any chance you could run the tests against a build with ASAN? The stack trace provided by it would give us a better clue of where the problematic free is happening.

SimonWoolf commented 2 years ago

Sure - unfortunately I'm about to go on holiday; I'll be back and hope to be able to get you that info on the 7th/8th of June.

SimonWoolf commented 2 years ago

Sorry for the delay; some other things came up (plus I had some trouble compiling the asan build).

Running the loadtest with asan build of node, most of the instances quickly died with the following stacktrace. This is a different error to the above; no idea of whether this is the same bug being exposed earlier by the asan build or whether this is some other issue, hopefully you do 🙂

FATAL ERROR: v8::HandleScope::CreateHandle() Cannot create a handle without a HandleScope
 1: 0x55f7c9983faf node::PosixSymbolDebuggingContext::GetStackTrace(void**, int) [node]
 2: 0x55f7c99810ad node::DumpBacktrace(_IO_FILE*) [node]
 3: 0x55f7c9cd88a7 node::Abort() [node]
 4: 0x55f7c9cd8db9  [node]
 5: 0x55f7ca519edd v8::Utils::ReportApiFailure(char const*, char const*) [node]
 6: 0x55f7cac75a91 v8::internal::HandleScope::Extend(v8::internal::Isolate*) [node]
 7: 0x55f7ca521956 v8::HandleScope::CreateHandle(v8::internal::Isolate*, unsigned long) [node]
 8: 0x55f7ca521b83 v8::EscapableHandleScope::EscapableHandleScope(v8::Isolate*) [node]
 9: 0x55f7ca5b60a2 v8::ObjectTemplate::NewInstance(v8::Local<v8::Context>) [node]
10: 0x55f7c9d74499 node::http2::Http2Stream::New(node::http2::Http2Session*, int, nghttp2_headers_category, int) [node]
11: 0x55f7c9d64b13 node::http2::Http2Session::OnBeginHeadersCallback(nghttp2_session*, nghttp2_frame const*, void*) [node]
12: 0x55f7cd648908  [node]
13: 0x55f7cd6484d6 nghttp2_session_on_request_headers_received [node]
14: 0x55f7cd658947  [node]
15: 0x55f7cd650790 nghttp2_session_mem_recv [node]
16: 0x55f7c9d735bc node::http2::Http2Session::ConsumeHTTP2Data() [node]
17: 0x55f7c9d7ca38 node::http2::Http2Session::OnStreamAfterWrite(node::WriteWrap*, int) [node]
18: 0x55f7ca152868 node::StreamResource::EmitAfterWrite(node::WriteWrap*, int) [node]
19: 0x55f7ca14d4db node::WriteWrap::OnDone(int) [node]
20: 0x55f7c9b30a28 node::StreamReq::Done(int, char const*) [node]
21: 0x55f7ca3ec092 node::crypto::TLSWrap::InvokeQueued(int, char const*) [node]
22: 0x55f7ca3f311e node::crypto::TLSWrap::EncOut() [node]
23: 0x55f7ca3f4779 node::crypto::TLSWrap::OnStreamAfterWrite(node::WriteWrap*, int) [node]
24: 0x55f7ca40df36  [node]
25: 0x55f7ca40de48  [node]
26: 0x55f7c9a1d2dd  [node]
27: 0x55f7c9a1b477 node::Environment::RunAndClearNativeImmediates(bool) [node]
28: 0x55f7c9a1998e node::Environment::CheckImmediate(uv_check_s*) [node]
29: 0x55f7cc8d4a0e  [node]
30: 0x55f7cc8afd7f uv_run [node]
31: 0x55f7c982cefe node::SpinEventLoop(node::Environment*) [node]
32: 0x55f7c9e582af node::NodeMainInstance::Run(int*, node::Environment*) [node]
33: 0x55f7c9e56bcc node::NodeMainInstance::Run(node::EnvSerializeInfo const*) [node]
34: 0x55f7c9b6bbc7 node::Start(int, char**) [node]
35: 0x55f7cd114200 main [node]
36: 0x7ff600a60d90  [/lib/x86_64-linux-gnu/libc.so.6]
37: 0x7ff600a60e40 __libc_start_main [/lib/x86_64-linux-gnu/libc.so.6]
38: 0x55f7c9759715 _start [node]
PID 1 received SIGSEGV for address: 0x0
node(__interceptor_backtrace+0x5b)[0x55f7c979524b]
/opt/ably/node_modules/segfault-handler/build/Release/segfault-handler.node(+0x3745)[0x7ff600003745]
/lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7ff600a79520]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x178)[0x7ff600a5f898]
node(+0x1ed88e4)[0x55f7c9cd88e4]
node(+0x1ed8db9)[0x55f7c9cd8db9]
node(_ZN2v85Utils16ReportApiFailureEPKcS2_+0xbd)[0x55f7ca519edd]
node(_ZN2v88internal11HandleScope6ExtendEPNS0_7IsolateE+0xb1)[0x55f7cac75a91]
node(_ZN2v811HandleScope12CreateHandleEPNS_8internal7IsolateEm+0x76)[0x55f7ca521956]
node(_ZN2v820EscapableHandleScopeC1EPNS_7IsolateE+0x103)[0x55f7ca521b83]
node(_ZN2v814ObjectTemplate11NewInstanceENS_5LocalINS_7ContextEEE+0x252)[0x55f7ca5b60a2]
node(_ZN4node5http211Http2Stream3NewEPNS0_12Http2SessionEi24nghttp2_headers_categoryi+0x229)[0x55f7c9d74499]
node(_ZN4node5http212Http2Session22OnBeginHeadersCallbackEP15nghttp2_sessionPK13nghttp2_framePv+0x3d3)[0x55f7c9d64b13]
node(+0x5848908)[0x55f7cd648908]
node(nghttp2_session_on_request_headers_received+0x856)[0x55f7cd6484d6]
node(+0x5858947)[0x55f7cd658947]
node(nghttp2_session_mem_recv+0x12a0)[0x55f7cd650790]
node(_ZN4node5http212Http2Session16ConsumeHTTP2DataEv+0x6fc)[0x55f7c9d735bc]
node(_ZN4node5http212Http2Session18OnStreamAfterWriteEPNS_9WriteWrapEi+0x518)[0x55f7c9d7ca38]
node(_ZN4node14StreamResource14EmitAfterWriteEPNS_9WriteWrapEi+0x198)[0x55f7ca152868]
node(_ZN4node9WriteWrap6OnDoneEi+0x2b)[0x55f7ca14d4db]
node(_ZN4node9StreamReq4DoneEiPKc+0x628)[0x55f7c9b30a28]
node(_ZN4node6crypto7TLSWrap12InvokeQueuedEiPKc+0x3a2)[0x55f7ca3ec092]
node(_ZN4node6crypto7TLSWrap6EncOutEv+0xa7e)[0x55f7ca3f311e]
node(_ZN4node6crypto7TLSWrap18OnStreamAfterWriteEPNS_9WriteWrapEi+0x749)[0x55f7ca3f4779]
node(+0x260df36)[0x55f7ca40df36]
node(+0x260de48)[0x55f7ca40de48]
node(+0x1c1d2dd)[0x55f7c9a1d2dd]
node(_ZN4node11Environment27RunAndClearNativeImmediatesEb+0x4e7)[0x55f7c9a1b477]
node(_ZN4node11Environment14CheckImmediateEP10uv_check_s+0x2ce)[0x55f7c9a1998e]
node(+0x4ad4a0e)[0x55f7cc8d4a0e]
node(uv_run+0x11f)[0x55f7cc8afd7f]
Matt-Esch commented 1 year ago

@facutuesca I have managed to reproduce with an ASAN build, see:

=================================================================
==1==ERROR: AddressSanitizer: heap-use-after-free on address 0xffff85539b9c at pc 0x0000030ac3b0 bp 0xffffdfed3f00 sp 0xffffdfed3ef8
READ of size 1 at 0xffff85539b9c thread T0
    #0 0x30ac3ac in nghttp2_session_close_stream (/usr/local/bin/node+0x30ac3ac)
    #1 0x30bd030 in nghttp2_session_mem_recv (/usr/local/bin/node+0x30bd030)
    #2 0xd45fa4 in node::http2::Http2Session::ConsumeHTTP2Data() (/usr/local/bin/node+0xd45fa4)
    #3 0xd4f4d8 in node::http2::Http2Session::OnStreamRead(long, uv_buf_t const&) (/usr/local/bin/node+0xd4f4d8)
    #4 0xf74bdc in node::LibuvStreamWrap::OnUvRead(long, uv_buf_t const*) (/usr/local/bin/node+0xf74bdc)
    #5 0xf76140 in node::LibuvStreamWrap::ReadStart()::$_1::__invoke(uv_stream_s*, long, uv_buf_t const*) (/usr/local/bin/node+0xf76140)
    #6 0x279a5c4 in uv__read /node-v16.18.0/out/../deps/uv/src/unix/stream.c:1247:7
    #7 0x279a5c4 in uv__stream_io /node-v16.18.0/out/../deps/uv/src/unix/stream.c:1315:5
    #8 0x27aaac4 in uv__io_poll /node-v16.18.0/out/../deps/uv/src/unix/epoll.c:374:11
    #9 0x277c788 in uv_run /node-v16.18.0/out/../deps/uv/src/unix/core.c:389:5
    #10 0xaa8b48 in node::SpinEventLoop(node::Environment*) (/usr/local/bin/node+0xaa8b48)
    #11 0xda2ac0 in node::NodeMainInstance::Run(int*, node::Environment*) (/usr/local/bin/node+0xda2ac0)
    #12 0xda2014 in node::NodeMainInstance::Run() (/usr/local/bin/node+0xda2014)
    #13 0xc4555c in node::Start(int, char**) (/usr/local/bin/node+0xc4555c)
    #14 0xffffb5217e0c in __libc_start_main (/lib/aarch64-linux-gnu/libc.so.6+0x20e0c)
    #15 0xa0a6e8 in _start (/usr/local/bin/node+0xa0a6e8)
0xffff85539b9c is located 220 bytes inside of 224-byte region [0xffff85539ac0,0xffff85539ba0)
freed by thread T0 here:
    #0 0xa77320 in free (/usr/local/bin/node+0xa77320)
    #1 0xd76a70 in node::mem::NgLibMemoryManager<node::http2::Http2Session, nghttp2_mem>::FreeImpl(void*, void*) (/usr/local/bin/node+0xd76a70)
    #2 0x30ac31c in nghttp2_session_close_stream (/usr/local/bin/node+0x30ac31c)
    #3 0x30b17a8 in session_after_frame_sent1 (/usr/local/bin/node+0x30b17a8)
    #4 0x30ad090 in nghttp2_session_mem_send (/usr/local/bin/node+0x30ad090)
    #5 0xd43c04 in node::http2::Http2Session::SendPendingData() (/usr/local/bin/node+0xd43c04)
    #6 0xd4785c in node::http2::Http2Stream::SubmitRstStream(unsigned int) (/usr/local/bin/node+0xd4785c)
    #7 0xd5a454 in node::http2::Http2Stream::RstStream(v8::FunctionCallbackInfo<v8::Value> const&) (/usr/local/bin/node+0xd5a454)
    #8 0x12fae78 in v8::internal::FunctionCallbackArguments::Call(v8::internal::CallHandlerInfo) (/usr/local/bin/node+0x12fae78)
    #9 0x12f9e24 in v8::internal::MaybeHandle<v8::internal::Object> v8::internal::(anonymous namespace)::HandleApiCallHelper<false>(v8::internal::Isolate*, v8::internal::Handle<v8::internal::HeapObject>, v8::internal::Handle<v8::internal::HeapObject>, v8::internal::Handle<v8::internal::FunctionTemplateInfo>, v8::internal::Handle<v8::internal::Object>, v8::internal::BuiltinArguments) (/usr/local/bin/node+0x12f9e24)
    #10 0x12f87b8 in v8::internal::Builtin_HandleApiCall(int, unsigned long*, v8::internal::Isolate*) (/usr/local/bin/node+0x12f87b8)
    #11 0x282ada8 in Builtins_CEntry_Return1_DontSaveFPRegs_ArgvOnStack_BuiltinExit (/usr/local/bin/node+0x282ada8)
    #12 0xffffa3333114  (<unknown module>)
    #13 0xffffa33394a8  (<unknown module>)
    #14 0xffffa3203a98  (<unknown module>)
    #15 0xffffa3336ba4  (<unknown module>)
    #16 0xffffa33326f8  (<unknown module>)
    #17 0xffffa33320d4  (<unknown module>)
    #18 0xffffa3333864  (<unknown module>)
    #19 0xffffa3334914  (<unknown module>)
    #20 0xffffa3337fb4  (<unknown module>)
    #21 0xffffa3334cb4  (<unknown module>)
    #22 0xffffa3334f2c  (<unknown module>)
    #23 0xffffa333512c  (<unknown module>)
    #24 0xffffa35ca540  (<unknown module>)
    #25 0xffffa33354ec  (<unknown module>)
    #26 0xffffa3399d60  (<unknown module>)
    #27 0xffffa338d224  (<unknown module>)
    #28 0xffffa3522978  (<unknown module>)
    #29 0xffffa35277a0  (<unknown module>)
previously allocated by thread T0 here:
    #0 0xa7787c in realloc (/usr/local/bin/node+0xa7787c)
    #1 0xd76c08 in node::mem::NgLibMemoryManager<node::http2::Http2Session, nghttp2_mem>::ReallocImpl(void*, unsigned long, void*) (/usr/local/bin/node+0xd76c08)
    #2 0x30aae38 in nghttp2_session_open_stream (/usr/local/bin/node+0x30aae38)
    #3 0x30ae228 in nghttp2_session_mem_send_internal (/usr/local/bin/node+0x30ae228)
    #4 0x30ad060 in nghttp2_session_mem_send (/usr/local/bin/node+0x30ad060)
    #5 0xd43ca4 in node::http2::Http2Session::SendPendingData() (/usr/local/bin/node+0xd43ca4)
    #6 0xd46140 in node::http2::Http2Session::ConsumeHTTP2Data() (/usr/local/bin/node+0xd46140)
    #7 0xd4f4d8 in node::http2::Http2Session::OnStreamRead(long, uv_buf_t const&) (/usr/local/bin/node+0xd4f4d8)
    #8 0xf74bdc in node::LibuvStreamWrap::OnUvRead(long, uv_buf_t const*) (/usr/local/bin/node+0xf74bdc)
    #9 0xf76140 in node::LibuvStreamWrap::ReadStart()::$_1::__invoke(uv_stream_s*, long, uv_buf_t const*) (/usr/local/bin/node+0xf76140)
    #10 0x279a5c4 in uv__read /node-v16.18.0/out/../deps/uv/src/unix/stream.c:1247:7
    #11 0x279a5c4 in uv__stream_io /node-v16.18.0/out/../deps/uv/src/unix/stream.c:1315:5
    #12 0x27aaac4 in uv__io_poll /node-v16.18.0/out/../deps/uv/src/unix/epoll.c:374:11
    #13 0x277c788 in uv_run /node-v16.18.0/out/../deps/uv/src/unix/core.c:389:5
    #14 0xaa8b48 in node::SpinEventLoop(node::Environment*) (/usr/local/bin/node+0xaa8b48)
    #15 0xda2ac0 in node::NodeMainInstance::Run(int*, node::Environment*) (/usr/local/bin/node+0xda2ac0)
    #16 0xda2014 in node::NodeMainInstance::Run() (/usr/local/bin/node+0xda2014)
    #17 0xc4555c in node::Start(int, char**) (/usr/local/bin/node+0xc4555c)
    #18 0xffffb5217e0c in __libc_start_main (/lib/aarch64-linux-gnu/libc.so.6+0x20e0c)
    #19 0xa0a6e8 in _start (/usr/local/bin/node+0xa0a6e8)
SUMMARY: AddressSanitizer: heap-use-after-free (/usr/local/bin/node+0x30ac3ac) in nghttp2_session_close_stream
Shadow bytes around the buggy address:
  0x200ff0aa7320: fd fd fd fd fd fd fd fd fa fa fa fa fa fa fa fa
  0x200ff0aa7330: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x200ff0aa7340: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x200ff0aa7350: fa fa fa fa fa fa fa fa fd fd fd fd fd fd fd fd
  0x200ff0aa7360: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
=>0x200ff0aa7370: fd fd fd[fd]fa fa fa fa fa fa fa fa fa fa fa fa
  0x200ff0aa7380: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x200ff0aa7390: 00 00 00 00 00 00 00 00 00 00 00 00 fa fa fa fa
  0x200ff0aa73a0: fa fa fa fa fa fa fa fa fd fd fd fd fd fd fd fd
  0x200ff0aa73b0: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x200ff0aa73c0: fd fd fd fd fd fd fd fd fa fa fa fa fa fa fa fa
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07 
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
  Shadow gap:              cc
==1==ABORTING
Matt-Esch commented 1 year ago

@mcollina @kumarak bumping this - would be good to get some eyes on this, let me know if you need anything else.

santigimeno commented 1 year ago

@Matt-Esch any possibility of having code that reproduces the issue?

Matt-Esch commented 1 year ago

@santigimeno I can only reproduce this on a large proprietary codebase, I've not been able to isolate this to a simple repro unfortunately. I do think the stack traces give us a lot of context. My feeling is that the stream should have been removed from the session in the first close call, the second call to close (on incoming RST) should not have been possible. So either the close callback is re-entrant (nghttp2 says explicitly not to do that) or the stream is not removed from that session, which would raise the question why is the session not removed on RST even if the underlying data is freed.

Matt-Esch commented 1 year ago

Looking over the nghttp2_session_close_stream code, there are a few places where stream-related memory can be freed but the stream not removed from the list, for example https://github.com/nodejs/node/blob/v16.18.0/deps/nghttp2/lib/nghttp2_session.c#L1210

I am going to instrument these points to see if they are the root cause.

Matt-Esch commented 1 year ago

Logs from nghttp2 when segfault occurs on stream 65:

send: available window=16384
send: padlen = 0, nothing to do
stream: stream=65 obq resched cycle=19232
send: next frame: DATA
send: start transmitting frame type=0, length=9
stream: stream=65 detach item=0xffff7b76e288
stream: remove stream 65 from stream 0
stream: adjusting kept idle streams num_idle_streams=0, max=100
send: end transmission of a frame
send: reset nghttp2_active_outbound_item
send: aob->item = 0xffff7b76e288

Looks like the transmission of a frame completes after the stream has already been closed.