Reusing HTTP connection lead to no destroy triggered

winguse commented 6 years ago

Version: v8.9.0
Platform: Linux cms-03 4.4.0-1052-aws #61-Ubuntu SMP Mon Feb 12 23:05:58 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Subsystem: NaN

poc.js:

const http = require('http');

const agent = new http.Agent({
    keepAlive: true
});

function get(id) {
    http.get({
        host: 'www.google.com',
        path: '/generate_204',
        agent
    }, () => {
        console.log('get response #' + id);
    });
}

for (let i = 0; i < 5; i++) {
    get(i);
}

run it with why-is-node-running:

$ why-is-node-running poc.js
probing program /root/poc.js
kill -SIGUSR1 32671 for logging
get response #0
get response #4
get response #3
get response #1
get response #2
There are 11 handle(s) keeping the process running

# SIGNALWRAP
/usr/lib/node_modules/why-is-node-running/include.js:3 - process.on('SIGUSR1', function() { why() })

# TCPWRAP
/root/poc.js:8 - http.get({

# TCPWRAP
/root/poc.js:8 - http.get({

# TCPWRAP
/root/poc.js:8 - http.get({

# TCPWRAP
/root/poc.js:8 - http.get({

# TCPWRAP
/root/poc.js:8 - http.get({

# HTTPPARSER
(unknown stack trace)

# HTTPPARSER
(unknown stack trace)

# HTTPPARSER
(unknown stack trace)

# HTTPPARSER
(unknown stack trace)

# HTTPPARSER
(unknown stack trace)

if there are requests to the same endpoint, the socket never closed and those destroys never get called.

old description

Hi team,

Thanks for the awesome job!

I am using the async_hooks in our production, but I've found some memory issue that possible related to async_hooks.

For debugging, I created a counter variable:

when init hook is called, increase by 1
when destroy hook is called, decrease by 1

And I got the following chart for the past 1.5 days (2018-04-05T13:06:12.724Z to 2018-04-06T22:46:53.753Z)

As check with pm2 info, the the chart above drop down to 0 is the time pm2 when restart the application.

And here is the memory usage:

As comparing, this is what the memory looks like one week before:

Our application is a web service running on top of restify.

The memory is consumption may beyond your expectation, it's because of I am holding some resource after init is called and those resource's releasing require destroy to be called.

Any ideas?

Trott commented 6 years ago

@nodejs/async_hooks

ofrobots commented 6 years ago

For debugging, I created a counter variable: when init hook is called, increase by 1 when destroy hook is called, decrease by 1

If you have a leaking resource in the application, one would observe the above counter to increment. This doesn't necessarily indicate a leak in async hooks, but more likely in the application itself.

As comparing, this is what the memory looks like one week before:

You didn't specify how the code was different a week before. I am guessing that you didn't have async hooks enabled then. If that's the case, my main suspicion would be the hook functions you're installing. Are you storing the resource objects into a array/map/set etc.? That's an easy way to write a memory leak. Keeping a reference to the resources will prevent them from getting garbage collected, preventing the destroy hook from being called.

winguse commented 6 years ago

Thanks for reply! I do use map to keep an object in side hooks:

when init I add it
when destroy I want to remove it

But destroy looks not called as I expected. What do you suggest me to do?

winguse commented 6 years ago

BTW, I did not touch the resource object passed in init function. You may check here: https://github.com/Tubitv/envoy-node/blob/dbc512fa1df73203a75cd0cc03b2b251fa597da8/src/envoy-context-store.ts#L55

winguse commented 6 years ago

Closed by mistake, sorry

winguse commented 6 years ago

@ofrobots is there any possibility that in our application (code run under restify, not under createHook) code has memory leak issues that stop async_hooks' destroy from being called?

AndreasMadsen commented 6 years ago

@winguse I think that is the most likely cause. You could try running https://github.com/mafintosh/why-is-node-running to see what isn't being destroyed.

winguse commented 6 years ago

Thanks @AndreasMadsen , I can see it's some problem of our code. It looks like our conversion from generator code to promise has some problems. The reason why we didn't see them before may due to it's not wasting too many memory.

winguse commented 6 years ago

As dig into detail, I find that it's not the generator problem -- they are destroyed a little bit late.

And as I collect data from our production, over 99.9% not destroyed stacks are this one:

TCPWRAP

at Agent.addRequest (_http_agent.js:178:22) at new ClientRequest (_http_client.js:258:16) at Object.request (http.js:38:10) at Request.start (.../node_modules/request/request.js:748:32) at Request.end (.../node_modules/request/request.js:1512:10) at end (.../node_modules/request/request.js:561:14) at Immediate. (.../node_modules/request/request.js:575:7) at runCallback (timers.js:789:20) at tryOnImmediate (timers.js:751:5) at processImmediate [as _immediateCallback] (timers.js:722:5)

in the top of the stack, i see this code: socket._handle.asyncReset();

    if (socket._handle && typeof socket._handle.asyncReset === 'function') {
      // Assign the handle a new asyncId and run any init() hooks.
      socket._handle.asyncReset();
      socket[async_id_symbol] = socket._handle.getAsyncId();
    }

I have no ideas about it, anyone can help?

winguse commented 6 years ago

I find the root cause of this now, it's because we are using request({ forever: true }) for reusing the http connection.

And request will use:

new http.Agent({
  keepAlive: true
});

updated the title and description as above.

AndreasMadsen commented 6 years ago

@winguse Do you still think this is an issue with node.js? If so, could you try writing up a test case?

winguse commented 6 years ago

@AndreasMadsen It looks to be a node.js issue to me, check the description above (i changed the title and the issue description).

I am not sure if this can be solve inside node.js, as this can lead to multiple execution blocks sharing the same tcp connection resource.

rochdev commented 6 years ago

We have the same issue. I let code similar to the one in the OP for about an hour and ended up with 2.5 million more inits than destroys, even after several garbage collections. This only happens when using new http.Agent({ keepAlive: true }), regardless of the maxSockets and maxFreeSockets setting.

Here is my test own code:

'use strict'

const ah = require('async_hooks')
const http = require('http')
const fs = require('fs')

const DEBUG = false
const ITERATIONS = 5000
const CONCURRENCY = 1
const SOCKETS = 1
const KEEP_ALIVE = true

let count = 0

const log = message => fs.writeSync(1, `${message}\n`)
const debug = message => DEBUG && log(message)

const hook = ah.createHook({
  init (asyncId, type) {
    count++
    debug(`init:    ${asyncId} (${type})`)
  },
  before (asyncId) {
    debug(`before:  ${asyncId}`)
  },
  after (asyncId) {
    debug(`after:   ${asyncId}`)
  },
  destroy (asyncId) {
    count--
    debug(`destroy: ${asyncId}`)
  }
})

hook.enable()

const agent = new http.Agent({
  keepAlive: KEEP_ALIVE,
  maxSockets: SOCKETS,
  maxFreeSockets: SOCKETS
})

let finished = 0

for (let i = 0; i < CONCURRENCY; i++) {
  operation(0)
}

function operation (index) {
  const req = http.request({ port: 1234, agent }, res => {
    res.on('data', () => {})
    res.on('end', () => {
      if (index < ITERATIONS) {
        setImmediate(() => operation(index + 1))
      } else {
        finished++

        if (finished === CONCURRENCY) {
          log(`total: ${count}`)
        }
      }
    })
  })

  req.on('error', () => {})

  req.end()
}

Something interesting that I've noticed is if you set CONCURRENCY higher than SOCKETS in the above example it seems to work properly, for example down from a count of 5011 to 11. My theory is that a socket may need to be destroyed to trigger all of the past requests' destroy. If volume is high enough that this never happens, then destroy would never be called for any request ever made on that socket.

@AndreasMadsen Is this the expected behavior?

Other things I've tried to no avail:

Using a pool with setInterval instead of recursion.
Waiting for a while after requests are finished before displaying the total.
Updating to Node 10.8.0 (using 8.11.1 in my tests).

rochdev commented 6 years ago

Just tried to destroy the socket and it doesn't seem to have any effect so I don't think my above theory is valid after all.

rochdev commented 6 years ago

As pointed out by @winguse, this seems to affect specifically HTTPPARSER and TCPWRAP. Would it be safe to assume before and after are only called once for those? Then it would be possible to destroy any resource created in init in after instead of destroy for these types specifically as a workaround.

rochdev commented 6 years ago

My last point is not an option unfortunately. From the doc:

before is called just before the resource's callback is called. It can be called 0-N times for handles (e.g. TCPWrap), and will be called exactly 1 time for requests (e.g. FSReqWrap).

That means that there is no way to know when to deallocate objects that were allocated in init without destroy being called. This is thus definitely a bug.

basti1302 commented 6 years ago

Hi folks,

I did a bit of digging and I think there are at least two different leaks at play for HTTP client requests. Here are two more simple cases that reproduce this problem:

In both cases you can prove that there is exactly one call to the init hook without a matching destroy. This in itself is not a leak, but as soon as you store anything in a map in init using the async_id as the key and remove it in destroy, this bug turns into a memory leak. The reproduction cases issue a number of HTTP requests (say 5000) and afterwards the number of items in leakyMap always equals the number of issued requests (plus 2 or 3 that are probably unrelated to the issue) - that is, the number of items stays at 5002 forever. This entries never get cleaned up because destroy is never called for that async_id. The first reproducer only uses Node.js core http module, the second uses request.

I added a bit of debugging code to the reproducers to be able to see where the async_ids that aren't destroyed are created and which type of resource they belong to. Here's what I have found:

First Case: Plain `http` Module

resource: HTTPParser {0: ƒ, 1: ƒ, 2: ƒ, 3: ƒ, 4: null, _headers: Array(0), _url: "", _consumed: false, socket: null, incoming: null, …}
type: "HTTPPARSER"
Stack Trace:
  0: "Error↵    "
  1: "captureStackTrace (/Users/bastian/instana/code/async_hooks-leak-reproducer/stacktrace_util.js:8:9)↵    "
  2: "AsyncHook.init (/Users/bastian/instana/code/async_hooks-leak-reproducer/hooks.js:9:21)↵    "
  3: "HTTPParser.emitInitNative (internal/async_hooks.js:137:43)↵    "
  4: "FreeList.parsersCb [as ctor] (_http_common.js:152:18)↵    "
  5: "FreeList.alloc (internal/freelist.js:14:17)↵    "
  6: "tickOnSocket (_http_client.js:625:24)↵    "
  7: "onSocketNT (_http_client.js:693:5)↵    "
  8: "process._tickCallback (internal/process/next_tick.js:63:19)"

Second Case: Using the `request/request` npm Module

resource: TCP {reading: true, onread: ƒ, onconnection: null, Symbol(owner): Socket, Symbol(handle_onclose): ƒ}
type: "TCPWRAP"
Stack Trace:
  0: "Error↵    "
  1: "captureStackTrace (/Users/bastian/instana/code/async_hooks-leak-reproducer/stacktrace_util.js:8:9)↵    "
  2: "AsyncHook.init (/Users/bastian/instana/code/async_hooks-leak-reproducer/hooks.js:9:21)↵    "
  3: "TCP.emitInitNative (internal/async_hooks.js:137:43)↵    "
  4: "Agent.addRequest (_http_agent.js:171:22)↵    "
  5: "new ClientRequest (_http_client.js:248:16)↵    "
  6: "Object.request (http.js:41:10)↵    "
  7: "Request.start (/Users/bastian/instana/code/async_hooks-leak-reproducer/node_modules/request/request.js:751:32)↵    "
  8: "Request.end (/Users/bastian/instana/code/async_hooks-leak-reproducer/node_modules/request/request.js:1511:10)↵    "
  9: "end (/Users/bastian/instana/code/async_hooks-leak-reproducer/node_modules/request/request.js:564:14)↵    "
 10: "Immediate._onImmediate (/Users/bastian/instana/code/async_hooks-leak-reproducer/node_modules/request/request.js:578:7)↵    "
 11: "runCallback (timers.js:694:18)↵    "
 12: "tryOnImmediate (timers.js:665:5)↵    "
 13: "processImmediate (timers.js:647:5)"

This second case matches exactly what @winguse already stated in https://github.com/nodejs/node/issues/19859#issuecomment-380484753. I think the asyncReset call generates a new asyncId for the socket, attaches that new asyncId to the socket and emits an init event for that new asyncId. However, it never calls destroy for the asyncId that was attached to the socket previously.

Both cases where run with Node.js v10.11.0 on MacOS.

Conclusion

From the debugging output above my guess is that there is

an issue with reusing HTTP parser instances, and
an issue with reusing TCP sockets when a keepAlive agent is used.

I haven't been able yet to reproduce the second problem just with Node.js core http module, but since the request lib uses http internally and does not concern itself with async_hooks, it should be possible to do so.

I'm pretty close to having a fix for the second/TCPWRAP issue and hope to submit a patch for that soon. I'll also have a look into the HTTP parser problem.

Cheers

Bastian Krol (Instana)

basti1302 commented 6 years ago

~I created a fix for the missing destroy for TCPWRAP in the case where an http agent is reused: https://github.com/nodejs/node/pull/23201.~

basti1302 commented 6 years ago

~A fix for the second issue (around HTTP_PARSER) is here: https://github.com/nodejs/node/pull/23263~

basti1302 commented 6 years ago

The two previous PRs have been merged into a single PR which fixes both issues: #23272

winguse commented 6 years ago

Thanks everyone!

nodejs / node