proposal: JS tracing APIs

jasongin commented 7 years ago

@nodejs/diagnostics and anyone else interested, please review this proposal draft.

AndreasMadsen commented 7 years ago

Is this a follow up on "trace events" as introduced in https://github.com/nodejs/node/pull/9304?

I'm generally against using the term tracing formally, it can mean so many things. For example we already have:

LTTng (Linux), SystemTap (Linux), DTrace (OSX), ETW (Windows)
async_hooks
trace_events

all of which does some kind of tracing. If this API integrates with trace_events I think we should just call the module trace_events.

jasongin commented 7 years ago

Yes, the Background section of the proposal doc links to that "trace events" PR.

As also noted in the doc, the module name "tracing" was already reserved for this purpose long ago. I don't feel strongly about it personally, certainly we can go with whatever name is agreed by consensus.

othiym23 commented 7 years ago

While I agree that treating those three APIs as separate concerns that can be implemented separately, I think it's helpful to treat emitting and listening of trace data as duals for a single abstraction.

As someone who, like @Qard, has worked on APM products in the past, the challenge for me was that there have been a variety of strategies for emitting trace events, but no reliable way to observe that data across platforms and deployment environments (in particular, PaaS and containerization). In practical terms, it feels like it's a lot easier to hit performance and overhead targets for the emitting rather than the listening side, and coming up with an efficient in-process interface for capturing that data would be the biggest win for a whole class of use cases.

trevnorris commented 7 years ago

I'd like to see documentation about functionality this introduces that would be impossible to implement on top of async hooks (the async hooks EP is up to date).

jasongin commented 7 years ago

I will update the doc to explain how this proposal relates to async hooks. I expect tracing instrumentation of node core can leverage async hooks for tracing of async operations, so in that regard the features are complementary.

Async hooks provide a way to hook into the beginning and end of async operations built on libuv primitives. Of course those async operations are a key subset of performance-related events.

This proposal is about a way to capture and/or listen to all kinds of performance-related events, including non-async begin/end, point-in-time, and counters. Additionally and most importantly it provides a very high-performance way to write the events out to a trace log file.

targos commented 7 years ago

Is this somehow related to the browser's User Timing API? See https://developer.mozilla.org/en-US/docs/Web/API/User_Timing_API

jasongin commented 7 years ago

@targos It is not directly related, but I'm aware of it. I think it would not be a good fit to implement the User Timing API directly in node, because it only has a subset of the functionality (does not support counters, categories, or any mechanism for turning off the tracing) and it does not have the right event-listener model.

If desired, it should be straightforward to create a polyfill for the User Timing API that would use this proposed Node.js tracing API underneath.

trevnorris commented 7 years ago

@jasongin I should have clarified. I specifically meant in regards to the JS API. I'm fine placing the trace event hooks in native source. I already have for some testing and they work great. My concern is the JS API. Even if the call does nothing but return immediately there's still enough overhead to be noticeable. Especially when the function call is cross module (V8 has a harder time optimizing those for some reason).

For async hooks to incur the least amount of overhead possible in similar situations I had to do some things I'm rather not proud of (see https://git.io/vDJq9 as an example). So I'd like to avoid placing anything directly in JS. I'd like to see if the JS trace events could be optional async hooks that are enabled when those specific trace events are requested.

jasongin commented 7 years ago

The main purpose of a JS API for tracing is to enable people to instrument their JavaScript code. While instrumenting internal async handles and other internal constructs may also provide useful data, that is only a piece of the performance picture and is not the focus of this proposal.

Even if the call does nothing but return immediately there's still enough overhead to be noticeable.

Only if your definition of "noticeable" is extremely generous. When tracing is completely off, any of the tracing event-emitter functions will literally be a no-op function, the equivalent of:

tracing.emit = function () {}

Even if the JIT cannot completely optimize-out the function call, it is still going to be extremely low-cost, I think small enough to not be noticeable in any reasonable sense.

Then of course when tracing gets enabled, either by turning on recording or adding a listener for any category, the emit functions will get replaced with functions that need to do a little bit of work, but are still fast enough not to have a significant impact on most applications.

trevnorris commented 7 years ago

I spent around 1.5 week figuring out the best way to optimize for calls to timers and nextTick in my async hooks PR. The reason it was so important to improve is because node doesn't code for the average performance case. We code it for the fast case. Which means measuring against a high volume scenario.

Take a simple http server and add hooks when resources are created and also add timing around each resource's callback. In a high volume scenario, let's say a proxy (and these numbers are small compared to what F5 pushes through node), and do 45k req/sec. Just tracking the mentioned locations you're going to add ~1.5 million calls/sec. That's ~1.5 million noops in JS that have now been added. Which aren't actually noops because V8 can't completely optimize them out. At that volume I'm not being "extremely generous". I'm being practical.

Looking at the code from the link I posted you'd see that all the emitBefore()/emitAfter() calls are conditionally called. Because simply calling the noop function cost too much when I was benchmarking. Short of this is, I didn't spend a month tweaking every corner of the async hooks API to make sure it had no noticeable affect on core's performance just so we could throw in another similar API that would.

jasongin commented 7 years ago

I appreciate the passion for performance. I am working on some benchmarks involving tracing in realistic scenarios. I will follow up when I have data ready to share.

rvagg commented 7 years ago

https://github.com/brycebaril/transaction-tracer is a proposal that @brycebaril put forward a few years ago coming out the Tracing Summit held a few years ago. Might be good to connect to some of that work. @othiym23 and @trevnorris you were involved in that too weren't you? Was there any other documentation or notes that came out of that which might be useful?

othiym23 commented 7 years ago

I'm pretty sure @brycebaril's notes were the most complete; @groundwater acted as our facilitator, and may have recorded some of the outcomes, but I don't know how extensive his notes were, given that he was facilitating. @tjfontaine and @misterdjules were our hosts at the Joyent Vancouver offices and had extensive contributions to make around our efforts to come up with something that merged both APM and dynamic tracing along the lines of DTrace and ETW. I believe @Qard was also at that meeting as well, representing his APM-vendor employers of the time (sorry, Stephen, I'm blanking on what company that was temporarily).

I don't think @trevnorris made it to that particular tracing meeting, although he was certainly integral to the discussions that @groundwater, @tjfontaine, and I were having earlier that year. Trevor was at the error-handling "summit" convened by @dshaw earlier that year, at which @davepacheco was also present. As Trevor's involvement in those discussions eventually produced async_hooks and AsyncWrap, those APIs are far more concrete than any meeting notes are likely to be. ;)

I'll double-check my notebooks from the time, but my recollection is that I didn't keep detailed notes, in part because the discussion was largely focused on establishing the problem space more than solutions, in part because I'd moved on from APM at the time and was more just trying to act as a kind of expert witness.

misterdjules commented 7 years ago

FWIW, my notes from the meeting were:

General observations

People don't want to have to monkey patch node to implement tracing.
People seem to be more keen to keep the tracing facilities in node itself rather than use platform tracing facilities like dtrace/ETW/{System,K}tap.

What some APM providers do/need

New Relic:

Problems they try to solve:
- help developers make their applications faster.
- Identify individual transactions that are slow.
- Don't generate very broad info (no big call graph), but more targetted info.
- As lightweight as possible. Needs to not have performance impact on the running app.
Solution they use:
- Uses AsyncListener polyfills or core implementation + continuous local storage to tie things together.

Strongloop:

Performance tracing: find performance problems and/or bugs (exceptions, leaks of descriptors, etc.).
- Needs to know start and stop of async events.
Solution they use: monkey patch everything.
Other problems they try to solve: heap/cpu profiling.
Use or would like to use zones.
Mentioned being interested in other discrete events like GC or Node about to exit/not being able to exit, etc.

Appneta:

Goal: time profiling of the whole async flow.
Mentioned needs to trace async calls in native modules.

Concurix:

Also instrument synchronous code.
Light monkey patching.
Needs to not have performance impact on the running app.

General comments:

Tools to allow native modules to emit tracing/performance events when they perform async operations using plain libuv.

Conclusion of the "problem discussion"

Priorities:
1. Unavoidable requirements:
  - Production ready.
  - Do not monkey patch node.
  - Do not alter behavior.
2. Async tracing.
3. Track async lifecycle.
4. native module API.

Implementation

Concerns/Questions:

Keep a global state of async trees/transactions but how do we know when we can cleanup some of these long lasting state structures?

Observations:

Have a simpler solution that just bubbles up simple information and not complete async trees. AL is too high level.

Proposed API by TJ/Jacob (for a simple solution):

Terminology:
- Namespace, provider, proble. Example: node:http:{client-request,client-response,etc.}
Advantages: similar to already defined probes.
API: global someProbe = createProbe('probeName', description of info that is going to be passed). The description of data is needed to support platforms providers like dtrace. someProbe.enable() someProbe.fire(function () { return [ data that is passed to the callback of the consumer]}) .list(): returns all the probes/namespaces/etc defined in the system. .on(namespace, provider, probe, callback) with support for wildcards, etc.
Use ring buffers and other similar techniques on the consumer side to avoid performance hit.
Implementation of user side in https://github.com/joyent/node-tracing .
Namespace is useful to make sure that there is no conflict between probes, and can also allow some optimizations like filtering out what the consumer doesn't care about.

Amendments by Jacob:

API: do not need to create the probe explicitely. This is done by the API the first time a probe is emitted. The concern with that (from TJ) is that the consumer of these events often needs to know in advance about the probes that can be fired. Not knowing about existing probes in advance can make using dtrace and other platform providers more difficult (but also other consuming platforms), and as Forest pointed out it's a valuable asset for node.

Concerns

This solution is not generic enough to provide a solution for the number one priority: Async tracing of any function/callback. Answer to this concern: this API can be used as a tool to provide information about more generic stuff like handles creations, etc.
How to trace what happens in third party modules? Adding static probes to third party modules doesn't seem to be realistic for some people (e.g concurix).
How to attach the metadata that consumers are interested in? How to make sure that consumers have enough information? Monkey patching is currently the easiest way to do that.
Brute force solution for the problem of getting the metadata: emit all the data of the context at the firing site with the event. Slightly better proposition from Jacob: emit the associated data only if requested by the consumer.
Idealistic solution: being able to inspect any context within v8 when the consumer handles a probe.

Qard commented 7 years ago

The challenge that APM vendors faced with coming up with a generic API was that we all had slightly different needs for data collection. That's a big part of why I've been pursuing https://github.com/nodejs/node-eps/pull/56 as a way to apply instrumentation more abstractly.

othiym23 commented 7 years ago

Thanks for the detailed notes, @misterdjules, and (belated) apologies for tagging you into the discussion!

Trott commented 6 years ago

Closing, but by all means re-open or comment if that isn't right.

nodejs / node-eps