naugtur / blocked-at

Detects node eventloop block and reports where it started
MIT License
319 stars 25 forks source link

blocked-at is unusable on express web servers - spikes CPU to 100% #36

Open powersjcb opened 7 months ago

powersjcb commented 7 months ago

I did some independent testing and confirmed we are having the same performance issues as described in this post. Our CPU utilization spiked to 100% and the node fell over when I tried to use this to find blocking stack traces from a dev envionrment.

https://www.ashbyhq.com/blog/engineering/detecting-event-loop-blockers

I was running node:18.18.1-alpine in k8s with express 4.17.3

naugtur commented 7 months ago

Yes, async hooks are expensive. The more promises.oer request the worse. This tool is not fit for running in prod.

I think I could improve it's performance a lot tho. Care to share a minimal reproduction case?

powersjcb commented 7 months ago

Sorry, I don't have any public repos I'm testing with.

The configuration I tested looks similar to this:

const blocked = require('blocked-at')
const express = require('express')

const datadogLib = require('./datadog');

const app = express()
const port = 3000

blocked((delay, stack, resource) => {
  datadogLib.histogram('node.eventloop.blocked', delay);
  logger.warn('blocked event loop', { delay, stack, resource });
}, { threshhold: 100, });

app.get('/', (req, res) => {
  // todo: make API requests
  res.send('Hello World!')
})

app.listen(port, () => {
  console.log(`Example app listening on port ${port}`)
})

I was seeing baseline CPU utilization increase from 3 mcores to 400 mcores. (memory remained about the same)

naugtur commented 7 months ago

Datadog might be key here. They do a great deal of monkey patching and use hooks too. Might be clashing. If you could test without datadog for me that would be helpful.

powersjcb commented 7 months ago

Yeah, that seems sensible. I will give that a shot.

I'm seeing that many DD lines are present in our stack traces so that wouldn't surprise me.

  "parsed": {
    "_timestamp": "2024-03-25T19:34:29.029Z",
    "blockedMs": 102.66803601074218,
    "level": "error",
    "message": "node event loop blocked",
    "resource": {
      "type": "HTTPINCOMINGMESSAGE"
    },
    "stack": [
      "    at AsyncHook.init (/app/node_modules/blocked-at/index.js:31:11)",
      "    at HTTPServerAsyncResource.emitInitNative (node:internal/async_hooks:202:43)",
      "    at connectionListenerInternal (node:_http_server:645:10)",
      "    at defaultTriggerAsyncIdScope (node:internal/async_hooks:463:18)",
      "    at Server.connectionListener (node:_http_server:618:3)",
      "    at Server.emit (node:events:529:35)",
      "    at Server.emit (/app/node_modules/dd-trace/packages/datadog-instrumentations/src/http/server.js:74:17)",
      "    at TCP.onconnection (node:net:2104:8)",
      "    at TCP.callbackTrampoline (node:internal/async_hooks:130:17)"
    ]
    "level": "error",
    "message": "node event loop blocked",
    "resource": {
      "type": "PROMISE"
    },
    "stack": [
      "    at AsyncHook.init (/app/node_modules/blocked-at/index.js:31:11)",
      "    at emitInitNative (node:internal/async_hooks:202:43)",
      "    at emitInitScript (node:internal/async_hooks:504:3)",
      "    at promiseInitHook (node:internal/async_hooks:324:3)",
      "    at promiseInitHookWithDestroyTracking (node:internal/async_hooks:328:3)",
... business logic redacted
      "    at /app/node_modules/dd-trace/packages/datadog-instrumentations/src/router.js:50:25",
      "    at /app/node_modules/dd-trace/packages/datadog-shimmer/src/shimmer.js:26:21",
      "    at Layer.handle [as handle_request] (/app/node_modules/express/lib/router/layer.js:95:5)",
      "    at next (/app/node_modules/express/lib/router/route.js:137:13)",
      "    at /app/node_modules/dd-trace/packages/datadog-instrumentations/src/router.js:100:12",
      "    at /app/node_modules/express-validation/lib/index.js:31:18",
      "    at processTicksAndRejections (node:internal/process/task_queues:95:5)"
    ]
powersjcb commented 7 months ago

My intuition is that we are probably seeing the performance issues because of this line.

https://github.com/naugtur/blocked-at/blob/master/index.js#L36C1-L37C1

Where we invoke a fairly expensive function to capture and store a stack trace on every event loop cycle. (this agrees with what this article mentions)

image

Do we know if there is a way to capture stack trace information only if the processing is too slow?

(testing a fork now to see if the stack collection is the problem)

naugtur commented 7 months ago

most likely problem is async hooks. they slow things down. delete hook is the most expensive and I know I can probably get rid of it by making things clean up after themselves on report or using a weakmap. But looking forward to your findings.