Feature Proposal: Tracing Channel / Diagnostics Channel support

tlhunter commented 1 year ago

Motivation

Hello!

I'm writing this as a member of the Datadog Node.js Tracer team where we work on an Application Performance Monitoring (APM) product (available at DataDog/dd-trace-js if you're curious). Prior to working at DataDog I spoke at RedisConf17 and RedisConf19, with both talks making use of node-redis.

I'm sending similar requests to maintainers of libraries that are instrumented by DataDog (not to mention other APM products). The tl;dr is that I would like to make a Pull Request to node-redis and would like your go-ahead before I start development work. This Pull Request will add support for a new feature that has landed in Node.js, TracingChannel. Does this sound like a change you'd be interested in accepting?

The rest of this is just a whole bunch of background in case you have questions:

Regarding definitions, an APM is a library that users install in their applications. Some popular APM products include DataDog, Elastic, New Relic, AppSignal, and Open Telemetry. These libraries do many things, but some common examples include sending metrics and logs for incoming HTTP requests, tracking how long a database query takes, and even building a graph of relationships between distributed systems. An incoming HTTP request might encompass an outbound HTTP request, a gRPC request, write to a file on disk, and cache something in Redis. As you can imagine, keeping track of how all of these I/O operations are connected is no easy task, and that's the sort of magic that happens with an APM library.

Depending on the type of library an APM tool might need to extract metadata about a call that an application is making to a library. For example, the APM tool might want to know the HTTP method and URL for an inbound or outbound request. Sometimes additional metadata needs to be inserted as well. For example, distributed tracing headers may be added to an outbound HTTP request or to a database query. The most exciting part about all this, in my opinion, is that it can be done automatically for a user's application.

Despite the name, "automatic instrumentation" is a very manual process, done by the authors of the APM libraries by way of monkey patching. In fact, here's dd-trace's redis instrumentation code. Essentially, monkey patching works by either replacing methods on a library, or by creating a new proxy interface in front of the library. When we get enough user requests to instrument a particular library we'll go through the code, figure out what needs to be patched to support expected use-cases, make the changes, and release them in dd-trace.

Of course, the other APM products do the same thing. node-redis has probably been instrumented a dozen times by now!

Monkey patching requires things like calling require() on a deeply nested file and replacing methods that might not even be documented. Of course, this isn't part of any sort of public API for a library, so even SemVer "patch" releases of a library can lead to APM engineers scrambling to fix compatibility issues.

At this point you might be thinking that APM maintainers have a fool's errand and that compatibility must be breaking all the time. And you'd be mostly right. But that sort of leads into the proposal:

I would like to implement a change to node-redis that adds support for TracingChannel (TC), a feature that has landed in Node.js v19.9, which it itself a superset of functionally provided by Diagnostics Channel (DC), a feature that is available in Node.js v14.17. These Node.js features were created specifically with APM support in mind. It is essentially a named, global pub/sub mechanism, as if someone declared a global EventEmitter instance (similar, of course, to Redis's PUBLISH/SUBSCRIBE, though limited to a single Node.js process). Code in one end of an application can publish to a channel and code in another end can subscribe to the channel. A TC channel is essentially an opinionated collection of five different DC channels (start, end, asyncStart, asyncEnd, error) for keeping track of what a library is doing. Channels are then named after the library, such as tracing:redis.query:start.

There are several benefits to incorporating TracingChannel directly into node-redis:

Performance: A user's application will be much quicker if instrumentation happens using these channels as compared to monkey patching.
Stability: A user's application is much less likely to have compatibility issues when upgrading libraries.
Standardization: Each APM vendor can now rely on a common interface for consuming data from libraries.

The DataDog Node.js tracer team is full of engineers who care a lot about Node.js and the ecosystem. Stephen Belanger, aka @qard, has made many contributions to Node.js core, including Diagnostics Channel and Tracing Channel. Bryan English, aka @bengl, has also made several contributions to Node.js core and, along with myself, co-authored the book Multithreaded JavaScript. I have also written Distributed Node.js. All three of us have given many different conference talks. I guess what I'm trying to say is that we really do intend for these changes to be beneficial to the Node.js ecosystem at large, and not just for paid APM products.

Basic Code Example

// Note that this wouldn't change any user-facing APIs,
// though user code may subscribe to the channels if they desire.

leibale commented 1 year ago

Hi,

TBH, I'm never played with the TracingChannel API, IDK how much it affects performance, but I do see the benefits for the users...

Can you please share some info about the performance of it? Did someone run benchmarks?
It looks like this API is still "experimental", do you know what is the plan for it? when this is going to be "stable"?

Thanks :)

Qard commented 1 year ago

Hi there, I'm the creator of diagnostics_channel and TracingChannel, so feel free to ask any questions you like and I'll be happy to answer. 👋🏻

TracingChannel released in Node.js very recently, so it will have a brief window of experimental status. However we've been working with a designing the pattern in the Datadog tracer for well over a year now and it's worked quite well for us. The intent is to get some libraries trying it out and prove the viability of pushing instrumentations upstream from tracing products to a shared interface in the libraries. I would not want it to land without a bit of adoption to prove out the concept, and in particular would like at least one instance of each of the different categories of things we trace: a routing framework, a key-value/cache client, a SQL client, etc.

As for benchmarks, there wasn't much benchmarking done for TracingChannel, but the underlying diagnostics_channel is a no-op if there are no subscribers (It did billions of publishes per second in the benchmarks on the original PR) and it's about 5-10x faster than event emitters when there are subscribers.

leibale commented 1 year ago

Sounds good, I guess we can give it a shot.. :) At the moment we are working on v5 (see here and here), I think it'll be a good idea to add it into one of the v5 "next" versions. WDUT? Does one of you want to contribute a PR (most of the socket/client changes for v5 are ready)?

tlhunter commented 1 year ago

@leibale thanks for the go-ahead!

I'll work on the PR. I can checkout the v5 branch; maybe I'll try to make a PR against v4 and v5. Do you know when the v5 release will happen?

leibale commented 1 year ago

"Soon"... We already have a "next" version for @redis/client (which is the sub package that contains what you're interested in)

redis / node-redis

Feature Proposal: Tracing Channel / Diagnostics Channel support #2590

Motivation

Basic Code Example