nodejs / node

Node.js JavaScript runtime ✨🐢🚀✨
https://nodejs.org
Other
107.39k stars 29.51k forks source link

Crash in wrap_js_stream.js during doClose #27258

Open kotsss opened 5 years ago

kotsss commented 5 years ago

I've debated whether this belongs in node/help, but due to potential of being a node bug I've decided to post here (apologize in advance if it turns out otherwise).

We have a crash happening on our production servers a few times a day (especially during high load times):

TypeError: Cannot read property 'finishWrite' of null
    at JSStreamWrap.finishWrite (internal/wrap_js_stream.js:195:12)
    at Immediate.setImmediate (internal/wrap_js_stream.js:218:12)
    at runCallback (timers.js:705:18)
    at tryOnImmediate (timers.js:676:5)
    at processImmediate (timers.js:658:5)

I was unable to see any possible code path that could cause this._handle to be null before the call to doClose as it's only triggered by calling this._handle.close(). Async paths printing would have come very useful here, however, this is node 10.x.

I've done grep on all our code as well as all node-modules for anything that could relate to 'games' with "_handle = " or doClose, nothing pop'ed up.

I've put DEBUG_NODE=net,stream_wrap on one of the servers to avoid overall network degradation (waiting for crash to re-occur).

Note: server is a large traffic networking proxy application involving http/s/2.

I would like to give more info (specifically setImmediate async callstack), however, I'm unsure how to continue, would appreciate any advice.

addaleax commented 5 years ago

How easy is this to reproduce for you? Can you modify the Node.js source code? It might be helpful to add an assertion in doClose() that checks that this._handle is still alive at that point, so that the crash is observable more easily?

kotsss commented 5 years ago

Putting a modified Node.js compilation on the server and waiting to reproduce was something we raised as an option but were hopping to for it to be plan B. Plan A was to find way (using this post) to alter the existing version:

But if you're saying that compiling is the way then I'll get working on that.

kotsss commented 5 years ago

Ok, I've got a modified node running and got new stacktrace:

AssertionError [ERR_ASSERTION]: Handle null - very bad
Error
    at JSStreamWrap.doClose (internal/wrap_js_stream.js:214:9) // assert(_this.handle)
    at JSStreamWrap.handle.close (internal/wrap_js_stream.js:39:12)
    at TLSWrap.close (_tls_wrap.js:395:23)
    at TLSSocket.Socket._destroy (net.js:595:18)
    at TLSSocket._destroy (/var/agent/util/net.js:379:37) // net.Socket.prototype hook for stats
    at TLSSocket.destroy (internal/streams/destroy.js:32:8)
    at JSStreamWrap.TLSSocket.wrap.once (_tls_wrap.js:317:35)
    at Object.onceWrapper (events.js:277:13)
    at JSStreamWrap.emit (events.js:189:13)
    at _handle.close (net.js:597:12)
    at Immediate.setImmediate (internal/wrap_js_stream.js:231:7) // cb();
    at runCallback (timers.js:705:18)
    at tryOnImmediate (timers.js:676:5)
    at processImmediate (timers.js:658:5)
Async
    at JSStreamWrap.doClose (internal/wrap_js_stream.js:212:30) // save first doClose stack on 'this'
    at JSStreamWrap.handle.close (internal/wrap_js_stream.js:39:12)
    at JSStreamWrap.Socket._destroy (net.js:595:18)
    at JSStreamWrap._destroy (/var/agent/util/net.js:379:37) // net.Socket.prototype hook for stats
    at JSStreamWrap.destroy (internal/streams/destroy.js:32:8)
    at Duplex.JSStreamWrap.stream.once (internal/wrap_js_stream.js:76:12)
    at Object.onceWrapper (events.js:277:13)
    at Duplex.emit (events.js:189:13)
    at emitCloseNT (internal/streams/destroy.js:59:8)
    at process._tickCallback (internal/process/next_tick.js:63:19)

So TLSWrap._parent points to Socket._handle that shows how the invalid call is possible, but this doesn't reproduce to a million other tls sockets used by the app.

Sounds like a race condition of some sorts, but it's beyond my code understanding to figure it out.

Edit: increased Error.stackTraceLimit and got full stack Edit2: add async jump to get more stack Edit3: managed to catch a non-crashing doClose stack for reference

STREAM_WRAP 13998: doClose stack
Error
    at JSStreamWrap.doClose (internal/wrap_js_stream.js:212:30)
    at JSStreamWrap.handle.close (internal/wrap_js_stream.js:39:12)
    at JSStreamWrap.Socket._destroy (net.js:595:18)
    at JSStreamWrap._destroy (/var/agent/util/net.js:379:37)
    at JSStreamWrap.destroy (internal/streams/destroy.js:32:8)
    at TLSWrap.close (_tls_wrap.js:393:29)
    at TLSSocket.Socket._destroy (net.js:595:18)
    at TLSSocket._destroy (/var/agent/util/net.js:379:37)
    at TLSSocket.destroy (internal/streams/destroy.js:32:8)
    at TLSSocket.onConnectEnd (_tls_wrap.js:1095:10)
    at Object.onceWrapper (events.js:277:13)
    at TLSSocket.emit (events.js:194:15)
    at endReadableNT (_stream_readable.js:1125:12)
    at process._tickCallback (internal/process/next_tick.js:63:19)
kotsss commented 5 years ago

We've been able to narrow the crash down to HTTP2 client sockets by correlating http2 usage and crash frequency. We're trying to reproduce issue in our tests, no luck yet. I'll be happy to run patches on the code to provide more information.

Will appreciate if someone can take a look at the stacktraces I provided and/or give more debugging instructions.

gireeshpunathil commented 4 years ago

looks like difference between good and bad case is the former is handling end while the later, close event, but definitely not enough to make any conclusions.

cc @nodejs/streams

omgaz commented 4 years ago

Since upgrading to Node 10 from Node 8 we've been seeing the following stack appear. I'm having trouble tracking down what's causing it - but could it be related to this issue?

{
  "date": "Thu Feb 06 2020 12:05:47 GMT+1100 (Australian Eastern Daylight Time)",
  "process": {
    "pid": 339,
    "uid": 496,
    "gid": 494,
    "cwd": "/var/app/current",
    "execPath": "/node-install/node-v10.18.1-linux-x64/bin/node",
    "version": "v10.18.1",
    "argv": [
      "/node-install/node-v10.18.1-linux-x64/bin/node",
      "/var/app/current/packages/app/lib/server.js"
    ],
    "memoryUsage": {
      "rss": 217214976,
      "heapTotal": 111587328,
      "heapUsed": 85340080,
      "external": 62857751
    }
  },
  "os": {
    "loadavg": [
      0,
      0,
      0
    ],
    "uptime": 89121
  },
  "trace": [
    {
      "column": 23,
      "file": "_tls_wrap.js",
      "function": "TLSWrap.close",
      "line": 395,
      "method": "close",
      "native": false
    },
    {
      "column": 18,
      "file": "net.js",
      "function": "TLSSocket.Socket._destroy",
      "line": 605,
      "method": "_destroy",
      "native": false
    },
    {
      "column": 8,
      "file": "internal/streams/destroy.js",
      "function": "TLSSocket.destroy",
      "line": 37,
      "method": "destroy",
      "native": false
    },
    {
      "column": 20,
      "file": "events.js",
      "function": "Object.onceWrapper",
      "line": 286,
      "method": "onceWrapper",
      "native": false
    },
    {
      "column": 13,
      "file": "events.js",
      "function": "TLSSocket.emit",
      "line": 198,
      "method": "emit",
      "native": false
    },
    {
      "column": 23,
      "file": "domain.js",
      "function": "TLSSocket.EventEmitter.emit",
      "line": 466,
      "method": "emit",
      "native": false
    },
    {
      "column": 14,
      "file": "_stream_writable.js",
      "function": "finishMaybe",
      "line": 646,
      "method": null,
      "native": false
    },
    {
      "column": 5,
      "file": "_stream_writable.js",
      "function": "ShutdownWrap.stream._final [as callback]",
      "line": 624,
      "method": "_final [as callback]",
      "native": false
    },
    {
      "column": 8,
      "file": "net.js",
      "function": "ShutdownWrap.afterShutdown [as oncomplete]",
      "line": 374,
      "method": "afterShutdown [as oncomplete]",
      "native": false
    }
  ],
  "stack": [
    "TypeError: this._parent.close is not a function",
    "    at TLSWrap.close (_tls_wrap.js:395:23)",
    "    at TLSSocket.Socket._destroy (net.js:605:18)",
    "    at TLSSocket.destroy (internal/streams/destroy.js:37:8)",
    "    at Object.onceWrapper (events.js:286:20)",
    "    at TLSSocket.emit (events.js:198:13)",
    "    at TLSSocket.EventEmitter.emit (domain.js:466:23)",
    "    at finishMaybe (_stream_writable.js:646:14)",
    "    at ShutdownWrap.stream._final [as callback] (_stream_writable.js:624:5)",
    "    at ShutdownWrap.afterShutdown [as oncomplete] (net.js:374:8)"
  ],
  "level": "error",
  "message": "uncaughtException: this._parent.close is not a function"
}
mcollina commented 4 years ago

@omgaz have you been able to reproduce this reliably? could you post a script to reproduce?

omgaz commented 4 years ago

@omgaz have you been able to reproduce this reliably? could you post a script to reproduce?

Not yet, it's proving taxing tracking down what could potentially be trying to end this connection. I'll update when/if I can repro the script.

pimterry commented 4 years ago

I've also been seeing a lot of this, with HTTP/2 traffic on both node v10.15.1 and v12.8.3, but I've just been testing v14.6.0 and it doesn't seem to reproduce at all there.

omgaz commented 4 years ago

I haven't been able to create an isolated example for this, but believe I'm getting this when using custom certificate authorities.