newrelic / node-newrelic

New Relic Node.js agent code base. Developers are welcome to create pull requests here, please see our contributing guidelines. For New Relic technical support, please go to http://support.newrelic.com.
Apache License 2.0
969 stars 399 forks source link

Reduce manual context tracking / use built-in mechanisms #580

Closed michaelgoin closed 3 years ago

michaelgoin commented 3 years ago

Hypothesis: using async-hooks across the board (especially newer Node 12/14 mechanisms) will reduce our context tracking overhead and make the code easier to understand in the future.

Unfortunately, the perf side is not something you can simply just validate as we have a lot of instrumentation tentacles. That being said, we've had folks such as fastify dev's mention we seem to add more overhead than other instrumentation they've seen in certain cases.

Is your feature request related to a problem? Please describe.

Currently, because of the age of our agent, we have a ton of built in context tracking mechanisms. In fact, we have two lightly different ways this is handled due to whether using older instrumentation or the "shim" instrumentation API.

Promises, on the other hand, leverage the async-hooks API. This API gives a flat level of overhead just subscribing to certain events. That means for some async situations (timers), we are eating that cost plus our manual tracking even if we bail out early from that in our async-hooks handlers.

Further, some tracking limitations could go away by leveraging async-hooks based mechanisms for example if a timer ref was grabbed prior to loading the agent.

This is not low-risk work but the hope is it would both be more maintainable in the long-run to new developers and also reduce overhead. If we are able to leverage the newer async resource functionality safely for newer Node versions (whether through the new async storage mechanism or manually), we may further reduce overhead but more importantly reduce risk of making customer promise leaks worse (like we can do currently). Also, in certain cases we attach state onto the objects we instrument which further introduces chance of customer code causing us to "leak" segments.

This is likely a multi-step effort.

The approach should probably look something like (but maybe not detailed enough):

  1. Introduce a new context tracking API with a single source of truth to get/set state.
  2. Update promise tracking to newer Node 12/14 mechanisms (if possible) to reduce overhead for promises.
  3. Convert rest of async tracking to new async tracking mechanisms/rip out old manual where possible.

An alternative would be to convert manual tracking to async-hooks as step 2, via the old-style, and then convert everything to newer patterns. Again, there's likely more breakdown needed here upon further exploration.

Some further exporation into the newer Async Storage mechanism is probably needed to understand if it is possible/the right thing to use or manually using executionAsyncResrouce. If it works, and is functioning well (haven't been tracking lately) it may be ideal as that helps it gain traction with another in-depth usage.

The newer patterns are only available in later versions of 12 and 14, so we'll likely need some old/new forking until Node 10 is dropped and even to handle early Node 12 (maybe? we could decide not to).

We'll want to measure impacts as we go along. There may be a hybrid state required for a bit... understanding impacts there will be important too.

Context-Tracking High-Level Notes

Reiterating caveat this might not be the ideal order but I do think we first want to migrate to a single context tracking API first.

Background Notes

There are multiple places to set and restore instrumentation from. Sometimes this is done manually outside of the core agent constructs and sometimes this is done via shim API and sometimes this is done via the tracer directly. Standardly, getting/setting is done via shim or tracer. The shim will first check on the object and then ask the tracer. The tracer, directly tracks the segment as a local member.

Due to the eventual storage being a local member of tracer, we end up forcing ourselves to manually set there via some mechanism. For async-hooks, that means we have to subscribe to every event and get/set via API. This prevents from being able to use newer mechanisms (Node 12/14) such as executionAsyncResource which has the potential to allow us to reduce overhead (going from 4 event subscriptions potentially down to 1, at minimum dropping destroy) and also avoid a class of memory issue by attaching our async context data to the underlying resource. The resource in the hooks isn't necessarily the "right" resource for the context of the user code we are trying to set state for, so the mechanism quickly breaks down. We need to be able to set and read context from the appropriate resource at the appropriate time (within instrumentation), not just from within async-hooks.

Just solving this for promises will be a win, as reducing async-hook handlers give improvements regardless of what logic may exist. And that is where we have our memory-leak increase issues.

Further, relying more heavily on async-hooks (or eventually the AsyncLocalStorage API), will solve a major class of state tracking pain. Greatly reducing the chance of additional introduce frameworks from breaking existing, functioning, async tracking. There will still state loss and state conflation issues but they may be greatly reduced. Also, moving our tracking to similar mechanisms will bring us much further along to following more current standards and being able to maintain/improve as Node does the same.

It is worth reiterating, there's a flat-overhead incurred by using any async-hook callback. As such, even though we only really use this functionality for promises, we eat the overhead for all async calls. So async state we are manually tracking incurs both the overhead of our manual tracking and the the flat overhead of the async-hook callback dispatch.

michaelgoin commented 3 years ago

Looks like this got lost in our conversion to the generic engineering board.

Our thinking on this has evolved some, as we want to try to use the new context API in Node, ensure we avoid the destroy hook, etc. I believe we already have a story in (should double-check) for prototyping potential solutions as a first-pass MMF. If not, this can be turned into that. Leaving up to go over as a team in grooming as there's a fair amount of background dumped in here.