nodejs / node

Node.js JavaScript runtime ✨🐢🚀✨
https://nodejs.org
Other
106.59k stars 29.06k forks source link

Proposal: mark AsyncResource and AsyncLocalStorage as stable #35286

Closed Qard closed 3 years ago

Qard commented 3 years ago

The AsyncResource API has been largely unchanged since its creation many majors ago, and AsyncLocalStorage has been out in the wild for awhile now and has proven its utility. I think both should be considered for stable status at this point. The challenge I see is that they live in the async_hooks module which has been marked as "experimental" at the top-level because the async_hooks feature itself exposes internals thus many (myself included) were unwilling to accept it as a "stable" API.

What I suggest is that AsyncResource and AsyncLocalStorage be considered as stable in the next major. Due to the module being named after the one not stable feature, it may also be worth moving the two APIs out to new top-level modules, but I'll leave that last point for discussion.

How should we proceed in getting these two APIs to stable status?

cc @nodejs/diagnostics

mmarchini commented 3 years ago

cc @nodejs/tsc for awareness

bmeck commented 3 years ago

I'm extremely not comfortable with enterWith or exit being considered non-experimental since they are an action at a distance, but the rest seems possible to me at least. If concrete and necessary use cases in the wild exist for those methods beyond being easier to consume than run I'd be open to conversation about why they are needed but I'm still not seeing wide adoption that forces them to exist. Those APIs are within the realm of what caused problems with domains in the post mortem and I'd be very hard pressed to understand why they are not the same danger.

Qard commented 3 years ago

Domains were global, so enter and exit applied globally. That meant calling either method could totally mess up some other components sense of what the context was. However, each instance of AsyncLocalStorage is separate, so you'd only ever get enterWith or exit called by the owner of that storage, thus they'd have more of a sense of what the "correct" thing to do is in their specific case. Messing with the context of an AsyncLocalStorage instance only impacts that instance, so I don't see how any of the issues with domains would emerge.

If people are really so concerned about those specific parts of the API though, I'm happy to consider marking the rest of the API stable and just leaving those specific parts as experimental. Also, I'm less sure about the stability of AsyncLocalStorage than AsyncResource, given its relative newness in comparison, though I'm fairly confident at this point there are no significant issues remaining.

bmeck commented 3 years ago

@qard The problem is just like how domain allows non-local and non-scoped manipulation of context in non-async boundaries. Your argument that only a "owner" will manipulate assumes that no-shared async storage would require that consumers have a specific coding style (in particular, not exposing a direct reference to an AsyncLocalStorage instance) which I don't find compelling. The docs show a local storage that is shared across multiple locations for example.

Qard commented 3 years ago

The docs show an AsyncLocalStorage instance shared within the same user code, which is the intent. Sharing a context object with other modules or other peoples code in general is a bad idea and not encouraged.

bmeck commented 3 years ago

@Qard

Sharing a context object with other modules or other peoples code in general is a bad idea and not encouraged.

Agreed, but the concerns of sharing it with only .run are much different from those 2 methods, hence me not wanting to move those methods out of experimental. The .run method cannot alter non-locally. We can wait on more usage before coming back to those.

Qard commented 3 years ago

The enterWith() method is needed for APM vendors to be able to achieve patch-free instrumentation with the combination of AsyncLocalStorage and the work I'm currently doing on diagnostics_channel.

With diagnostics_channel, http servers can publish a message indicating the start of a connection, which a subscriber can then do als.enterWith(context) to bind the entire request to the context. If there's only als.run(...) then they would need to patch http servers to intercept the request handler, which is fragile and can negatively impact performance. There's also many ways a request handler can be added, and they could be removed and added again any time in the future, due to servers just being an event emitter. This makes the code to correctly patch an http server actually fairly complicated, whereas with the diagnostics_channel message the handler itself becomes irrelevant.

Personally, I disagree that there is any issue with enterWith() or exit(). If you're still concerned about it, perhaps we should add more explicit warnings to the docs about sharing AsyncLocalStorage contexts? Regardless of those specific functions, sharing a context is just a bad idea because it gives you a single object space that can be overridden at any time. The danger of some downstream code calling enterWith() is no more dangerous than something calling run() and messing up the context you expect to have in a callback. Sharing a context object between consumers is just universally a bad idea and this specific problem with domains is exactly why AsyncLocalStorage was created as a non-global system, allowing each user that wants to propagate context to have their own isolated object to do it with.

I entirely disagree that this is a repeat of the domains issues. Domains were problematic because they all shared one context globally--you can't be in more than one domain at a time. That's not the case here, each context is isolated.

bmeck commented 3 years ago

@Qard I do not agree with that last comment, hence me not wanting to move those methods out of experimental. We can have this discussion here if necessary but it likely would be more useful to have it elsewhere. Perhaps an explanation of why that specific API design which does have non-local affects is necessary would be more helpful. I'm still unclear on why APM is unable to use .run. I will repeat that I am extremely uncomfortable with those APIs and having convenience in usage is not my priority in allowing those to move forward to stable. The rest of the APIs seem fine to me.

AndreasMadsen commented 3 years ago

+1 on marking AsyncResource as stable. It being experimental is the primary reason module authors don't want to use it. Other parts of async_hooks still needs work, but I don't see it interact with the AsyncResource API.

I would suggest adding a documentation section, regarding what node.js would in the future, when the other parts are stable, consider a major/minor/patch change to the async context graph. This would serve as a default guideline for module authors when they change the async context graph.

PS: Be sure to also mark the C++ AsyncResource API stable as well.

mcollina commented 3 years ago

I would prefer to split those two things:

  1. mark AsyncResource stable
  2. keep AsyncLocalStorage experimental for some more time
mhdawson commented 3 years ago

My initial thinking is along the same lines as @mcollina with leaving AsyncLocalStorage in experimental for a bit longer.

Qard commented 3 years ago

Sounds good re: letting AsyncLocalStorage sit a bit longer. I'm mainly pushing for AsyncResource here because, as already explained by Andreas, module authors are refusing adoption because of it being marked as experimental and that is resulting in context loss scenarios.

I do think we should make a more specific decision at least around when we should re-evaluate if AsyncLocalStorage is ready though. We've sometimes let things sit in experimental forever and go stale, which I'd like to avoid here as this is an important feature for APMs.

mhdawson commented 3 years ago

@Qard agreed on thinking about the timeline for AsyncLocalStorage. For some of the other Experimental features we agreed on a set of exit criteria in terms of usage, etc. and that was useful in making the case when it was the right time to exit Experimental.

Qard commented 3 years ago

@bmeck Here's a simple example of using a combination of AsyncLocalStorage and diagnostics_channel to enable patch-free instrumentation.

const dc = require('diagnostics_channel')
const onHttpRequestStart = dc.channel('http.server.request.start')
const onMySQLQuery = dc.channel('mysql.query')

const als = new AsyncLocalStorage()

// If an incoming request is received, create a Trace object
// and set it as the context for the request.
// This channel message does not intercept the handler
// as it should not have access to mutate arguments. But it
// triggers this message handler before running `request`
// events, so it can be used to enter the context first.
onHttpRequestStart.subscribe(() => {
  als.enterWith(new Trace())
})

onMySQLQuery.subscribe(() => {
  const trace = als.getStore()
  if (!trace) return
  // Report the MySQL event to the trace
})

Note that to be able to start the als context for the http request, we would need to either use als.enterWith(...) in the diagnostics_channel message handler or patch http servers to wrap the request handler in als.run(...). The patching method creates additional closures and, as explained before, there are many ways to attach a handler, and they can be removed too, so simply patching createServer(...) is not enough.

As for the comment about enterWith(...) not really being any more dangerous than run(...). Consider this app:

const app = express()
const als = new AsyncLocalStorage()

app.get('/user/:id', (req, res) => {
  const requestData = {
    userId: req.params.id
  }

  als.run(requestData, () => {
    // A function from some other module
    doAThing((err, user) => {
      console.log(als.getStore().userId)
    })
  })
})

If doAThing looks like this:

function doAThing(cb) {
  // BOOM. `requestData` is gone.
  als.run({ some: 'data' }, cb)
}

it will blow away the store entirely, because the two points are unaware of each other calling run. The console log will log undefined.

If doAThing looks like this:

function doAThing(cb) {
  als.getStore().userId = 'some id'
  cb()
}

it will replace the userId that was already in the store with a new one. The console log will log 'some id'. You could also run into situations where it checks for presence of a named field, expecting it to be set from one place but it's been set by something completely unrelated and causes all sorts of havoc.

tl;dr: Do not ever share a store with code you don't control. It's essentially the same concept as setting all your variables as globals. You're bound to run into a collision eventually and it's going to be hell to debug. Just don't do it. 😬

bmeck commented 3 years ago

@Qard it is more complicated than that, consider:

const als = new AsyncLocalStorage();
emitter.on('my-event', () => {
  als.enterWith(store);
});
emitter.on('my-event', () => {
  als.getStore();
});
ee.emit('my-event');

The problem with these bleed through APIs unlike .run is you don't have fine delineation.

ee.prependListener('my-event', () => {
  als.getStore(); // gets the store BEFORE enterWith
});

Will escape this kind of implicit timing based side effect.

ee.emit('my-event');
// AFTER
als.getStore(); // still has the store meant to span 'my-event'

Will keep the store after the span it sought to encompass.

If you combine the 2, you can get stale store values. I can imagine plenty more scenarios where this is true, but its pretty simple to see. I'm not against solving the problem of needing to span across calls, I am only uncomfortable with the current APIs.


Per your example with .run that same sharing has a problem with .enterWith but is not the major concern of mine. I'd agree that sharing is dangerous regardless but other issues around accidental sharing as my example in this comment are the real worry.

Qard commented 3 years ago

That's intentional. It's setting the context for the rest of the sync tick, which all that is part of. That's actually specifically the functionality that is needed for using AsyncLocalStorage and diagnostics_channel together for patch-free instrumentation, and also why AsyncLocalStorage contexts are intentionally designed to be isolated from each other rather than having some shared global context.

My APM agent can say I want the http request diagnostics_channel event to trigger enterWith() on my AsyncLocalStorage context and that will not influence someone else's separate AsyncLocalStorage context in any way. The owner has the control, and that's the intent. If I hand off my context store to someone else and they mess it up, that's my own fault. That's like saying a stream based API shouldn't be allowed to return a stream and should instead just take a callback to receive the chunks, because if they return the stream someone could pause it--that's kind of the point.

bmeck commented 3 years ago

@Qard I understand that is the design of the API, but disagree on the intent of the API being to be the API. The API is to solve the problem of wrapping around these contextual event spans. My example above is showing that the API is leaking outside of those spans. I understand how you can achieve some things with the APIs but I am not sold on the leakage being non-trivial. The problem that is attempting to be solved by those APIs seems to differ from the affects of the APIs.

Qard commented 3 years ago

It's not leaking though. It's capturing exactly the span it's supposed to, as indicated in the docs which call out this exact behaviour. It is intentional that the context continues for the rest of the sync tick and not just a lexical scope. It is not trying to capture only the local scope of the function where it is placed, that's what run() is for. see: https://nodejs.org/api/async_hooks.html#async_hooks_asynclocalstorage_enterwith_store

bmeck commented 3 years ago

@qard it goes beyond the events and outside of the emit. I do not dispute the documentation, just that the API is a leaking abstraction that goes beyond the goal of wrapping a specific time frame and instead wraps past that time frame. It would be good to understand why having the store set to something AFTER the emit() in my example completes is intended and needed. I can understand why it needs to go after the first listener, that is understood. Per my comment above, the current API can generate stale references.

Qard commented 3 years ago

It's not going beyond the goal. It's achieving exactly the intended goal. That API was created by me, and I wrote it specifically for the purpose of pairing it with my now in-progress work on diagnostics_channel to be able to tell an AsyncLocalStorage to enter a context at the point that diagnostics_channel emits the request start event just before the server emits the request event. At that point, everything in the rest of the sync tick should be linked to that context, not just the current lexical scope. Rather than run() which says "make a new async scope and set its context to this" the enterWith() method is saying "for the current scope, set its context to this".

bmeck commented 3 years ago

@Qard It is unclear to me what you are talking about at this point, you are describing the API and not the problem it is trying to solve. I'm asking to understand why the API isn't considered to be leaking given that it doesn't appear to be wrapping the event but rather the event plus a long tail until the sync stack of anything unwinds. Repeating that you have achieve an intended goal is well and good, but it doesn't explain what the goal was, which I understand to be setting the store value to be something during the event so that it can be persisted. A more concrete example rephrasing my above comment:

var { AsyncLocalStorage } = require('async_hooks');
var { EventEmitter } = require('events');
var als = new AsyncLocalStorage();
var ee = new EventEmitter();
var incr = ((i = 1) => () => i++)();
ee.on('my-event', () => {
  var value = incr();
  console.log('pre-entering: %s previous: %s', value, als.getStore());
  als.enterWith(value);
});
ee.on('my-event', () => {
  console.log('post-entering: %s', als.getStore());
});
function run() {
  console.log('run');
  ee.emit('my-event');
  setTimeout(() => {
    console.log('unrelated timeout got: %s', als.getStore());
  }, 0);
}
run();
run();

You can see old values crossing into "previous:" and the timeout getting values even if it is outside the event. It would be good to understand why this is not problematic. I remain uncomfortable with those 2 APIs given examples like this one.

Qard commented 3 years ago

An async server is a tree of spans of sync execution. This is modelled with async_hooks in the form of execution and trigger ids along with lifecycle events. In an app with nothing but a single set immediate, you will have the top-level sync tick execution span which, between the start and end of that sync tick, will emit an init event for the setImmediate. Later, when the setImmediate callback is triggered, a before event will trigger before running the callback and an after will trigger after.

These before and after events will be linked to the init event through a shared async id, allowing one to traverse the async execution graph in id form and enabling correlation of anything between the before and after to be correlated to a point in that graph by using the current async id. All this complicated graph management is what AsyncLocalStorage does internally by attaching the given context object to an async resource and propagating it onto newly created async resources whenever an init happens. When that previously mentioned setImmediate triggers its init, the current value stored in the context (if there is one) will be attached to the newly created resource so that value will be used as the context between when the before and after of that resource are run.

The run() method of AsyncLocalStorage encapsulates a callback in a new async resource to ensure that everything within that callback gets attributed to that resource. What it actually does is just call enterWith() between the before and after of a newly constructed AsyncResource. This means that, because AsyncResource automatically becomes the current resource when the runInAsyncScope() method starts and is popped from the execution async resource stack when it ends, any async activity within the callback will be linked to that resource for the specific span of time that it is considered the current resource, which is the before/after boundary triggered around runInAsyncScope(). Lifting enterWith() out from running inside of run() just means it's binding to the existing current resource instead of creating a new one. This means that until the current execution async resource is popped from the stack when the after happens it will propagate the current value into all async branches initialized in that time. If enterWith() is called midway through a tick, it means for the remainder of that tick, or until a new context is set, propagate this object into any async resources initialized in this sync tick.

Being able to change the context object partway through the current tick is an intended feature and enables patch-free instrumentation. In the immediate term, it enables something like this:

// Say I have a simple http server that makes a SQL query
// I want to create a trace for the request and attribute the
// mysql query to that trace...
const server = http.createServer((req, res) => {
  mysql.query('...', (err, data) => {
    res.json(data)
  })
})

// ... a naive APM agent might do something like this:
const als = new AsyncLocalStorage()

// In a real-world scenario, we wouldn't have access to `server`
// so we'd have to patch the module loader to intercept the http
// when it is loaded and patch a large chunk of its API to basically
// just achieve what these three lines of code below are doing.
server.prependListener('request', () => {
  als.enterWith(new Trace())
})

// A more real-world looking patch to collect spans from MySQL
// would look something like this...
const original = mysql.query
mysql.query = function wrappedMySQLQuery(query, callback) {
  const trace = als.getStore()
  const span = trace.startSpan('mysql', 'query')

  return original.call(this, query, function wrappedMySQLQueryCallback(err, res) {
    span.end()
    return callback.call(this, err, res)
  })
}

With diagnostics_channel, we will effectively be getting an event immediately before that request event and through an API which requires no specific awareness on the users part of the existence of any specific server object. In the previous example, the server variable needed to be shared, which is rarely the case so more complex monkey-patching is made to intercept the entire module loading process just so we can do that single server.prependListener(...) on every server we see. The amount of hackery APMs need to do just to effectively run a handful of lines of code on some object we don't have direct access to is, frankly, absurd but also quite fragile and damaging to performance.

Anyway, my point is, the ability to change the context partway through a tick is an important feature for APMs in many places. The pre-http request example is the main one, but sometimes we need it for other types of servers. Another case is that APMs are encouraged to preload before app code via node -r $module, and if the user wants a trace of the entire process from the top-level, we can't wrap the top-level of the entry point module automatically without something like enterWith() to set our context object on the current async tick scope.

A ton of work has gone into getting these APIs to where they are now. Solving the problem of async execution graph tracking has literally been the core focus of my career for nearly a decade now, so I've put a ton of thought into it and built several APM companies agents around this model. While the AsyncLocalStorage API is still relatively new to core, it has been something I, and many others working in APM, have been pushing for in core for a long, long time.

Qard commented 3 years ago

Also, everything about your example is working exactly as intended. The context value is supposed to persist until the end of the tick so any async branches after that point will propagate that object into those async branches. If the object is changed midway through a tick, any future async inits in the remainder of that sync tick should propagate the new object. The "value" of the context between before and after hooks should be whatever the value was at the time the resource init event happened.

Flarna commented 3 years ago

I'm +1 on promoting AsyncResource stable.

I would wait a little bit longer with AsyncLocalStorage for following reasons:

Regarding your discussion @bmeck @Qard I'm with @Qard regarding the need of these APIs. I'm aware that they may cause problems if used wrong. But this is the nature of all APIs. Someone could argue that all open calls may result in a leak if close is not called... The major difference between AsyncLocalStorage and domains is that domains consumed/redirected/discarded errors which was a global state. If one AsyncLocalStorage user garbles it's own als instance by using the wrong APIs it's a local problem. There is no global state change nor is there any impact to other als instances. @bmeck Do you have some proposal how an API should look like to reach the requirement described by @Qard to attach to an existing resource for it's lifetime?

FWIW .NET offers an AsyncLocal class with is similar. They don't offer something like .run, just set a value.

bmeck commented 3 years ago

Anyway, my point is, the ability to change the context partway through a tick is an important feature for APMs in many places.

I agree to this and do think we should have some way to solve it, but not to the points in my comments being non-trivial.

@bmeck Do you have some proposal how an API should look like to reach the requirement described by @Qard to attach to an existing resource for it's lifetime?

I have no real demands on the actual signature but it shouldn't be the end of the current tick that determines when it is changed. An API that matches the following would work fine to me:

Something that persists until a synchronous termination before the end of the tick. APIs like eventEmitter.emit('request') are a good example of something that the assignment should stop at the end of. I am not opposed to it escaping the local call frame of the event.

Right now, the API to me bleeds past the intended synchronous call frame that it was hoping to wrap and I still don't see an explanation of WHY going past that call frame is intended and necessary, just explanations of that it does and it is intended. I am familiar with these APIs and async queueing, I don't think everyone needs to establish their level of familiarity here. TC39 also has had presentations on these issues which I have been a part of and I have had plenty of discussions on why AsyncLocalStorage style APIs are rejected there, most recently with Async Contexts. I am not trying to be as absolute as I would be in TC39 and do want to find a middle ground, but find that subset of the current APIs to be non-starters without clarification and exhaustive production usage from many sources showing no issues (either with development or runtime).

Qard commented 3 years ago

It needs to be until the end of the tick though because the point is that we want to capture everything in that tick. An APM product wants to trace an entire request, not just part of it. At the start of the tick which runs the request handler until the end of it, we want to know everything that happens within that tick and within every async task descending from that tick so we can attribute that activity to the request that spawned it. If, for example, running the request handler was followed by running a setTimeout to timeout the request if it's not finished soon enough, we want to see that. We need our context object, which is probably our tracer, to be propagated into that so we have some way of finding the current trace within the descending async execution. I created enterWith() specifically because there are a few scenarios where we need to capture until the end of the tick. I was quite explicit in the documentation about warning exactly what this means. The suggested API for 90% of users is definitely the run() method, enterWith() is a special-case API.

bmeck commented 3 years ago

@Qard that makes the assumption that there is a contract that all calls grouped within the same tick are co-related, this is not true in the language and not clearly delineated in the code. I am asking that we have clear delineation, not coincidental relation. If we can somehow delineate clearly that we are exiting the relation I think having implicit / dynamic changing of context is acceptable as a compromise, TC39 disagreed due to dynamic scoping concerns. My compromise is that we have a clear delineation not that we prevent dynamic scoping and is based upon examples of having same tick but unrelated events. It would be a nigh impossible burden to ensure that every tick does not have multiple async resources that collide given the current APIs to my knowledge. I am only not requesting removal and didn't prevent landing because it is experimental and I hope over time we can tweak the API for clear delineations. I understand the APIs have the side effects you want for some scenarios, but the API contract is not clear nor reliable given my example above. Your assumption that all calls in the same tick needs more proof and to address counter concerns of having multiple unrelated things in the same tick. Clear delineation would satisfy my concerns even if it doesn't prevent code from violating the implicit and unobservable contract from existing. I am trying to be reasonable but I still am unclear on the WHY of the API design given counter examples. Stating that the API is working as intended isn't really a way for me to better understand why my counter examples are out of scope. Stating that my counter example doesn't fit the APIs in question's expected implicit contract raises more concern, not lessens it.

Qard commented 3 years ago

The correlation an APM wants is that something happened in the same tick, thus delineating at the barriers of the tick is exactly what we need. The interest of an APM is to draw a picture of what technically is going on in the system, and that is that a linear sequence of ticks is happening on the main thread with causation linked back to some prior tick, but not necessarily the one immediately before the current one, which necessitates us switching a global pointer at the start of tick to know what the triggering tick was up until the end of the tick. We want an object that follows the exact same lifecycle as the init/before/after of async_hooks, which is exactly what we're getting for all descending sync ticks, but not the current tick, unless we use enterWith().

Qard commented 3 years ago

Also, yes, we actually do have a guarantee that each tick has a single async resource, which is even exposed through the async_hooks.executionAsyncResource() API, which is exactly what AsyncLocalStorage is built on top of. Without that guarantee, async_hooks wouldn't even function properly, because it attributes the before/after around a tick to a given async resource by its id.

bmeck commented 3 years ago

It needs to be until the end of the tick though because the point is that we want to capture everything in that tick. An APM product wants to trace an entire request, not just part of it. At the start of the tick which runs the request handler until the end of it, we want to know everything that happens within that tick and within every async task descending from that tick so we can attribute that activity to the request that spawned it. If, for example, running the request handler was followed by running a setTimeout to timeout the request if it's not finished soon enough, we want to see that. We need our context object, which is probably our tracer, to be propagated into that so we have some way of finding the current trace within the descending async execution. I created enterWith() specifically because there are a few scenarios where we need to capture until the end of the tick. I was quite explicit in the documentation about warning exactly what this means. The suggested API for 90% of users is definitely the run() method, enterWith() is a special-case API.

As stated in my comment above, this is an assumption that the tick is only related to a single thing. This is not something we can specify to people creating code clearly without a delineation mechanism; otherwise, we must state that per a given tick, only 1 thing can be safely related. The problem that I keep bringing up is not that you want to see a setTimeout related to a request but per my example, a setTimeout that is not related to the thing, but queued on the same tick, hence the naming in my example code. I understand that the documentation explains the contract to people using .enterWith but it does not provide means for people being observed by these mechanisms to delineate when things become unrelated.

If we are to state that assumption, we could try a tweak such that multiple .enterWith on the same tick causes a thrown error to prevent that? I personally think the API is still a problem, but at the minimum that would cause the creator of the .enterWith code to see that something is outside of the expectations of the .enterWith API.

Also, yes, we actually do have a guarantee that each tick has a single async resource, which is even exposed through the async_hooks.executionAsyncResource() API, which is exactly what AsyncLocalStorage is built on top of. Without that guarantee, async_hooks wouldn't even function properly, because it attributes the before/after around a tick to a given async resource by its id.

We have a single "incumbent" AsyncResource in WHATWG terms, but my concern is about the less concrete idea of simpler async resources (tasks?) / user land things created by your current tick. Perhaps we can adopt terminology to more clearly differentiate things.


If desirable we can setup a call next week sometime to directly discuss these 2 APIs and perhaps get a better understanding as I feel we are currently talking past each other. Such a discussion could be intended as a discussion about needs to get the API's solutions to problems out of experimental not necessarily about removal or persistence of the exact existing API.

Qard commented 3 years ago

The setTimeout that is queued on the same tick is related though, because the specific criteria we have to declare something as related is that it happens in the same tick. It is intentional that doing an enterWith() in an event emitter handler would make other event handlers run within that context, attributing any async activity within it to the same context as the rest of the tick. Things become unrelated when the tick ends, which is exactly what async_hooks already does. In fact, the behaviour of enterWith(...) is functionally identical to AsyncHooks.enable()--you can enable a hook set at any point in a sync tick and all following async activity initialized either until the end of that tick or until disable() is called, will trigger async_hooks lifecycle events. That's exactly what I need to be able to do with AsyncLocalStorage too.

As for arranging a call, if you think that is necessary to understanding the importance of this feature we could arrange to go over it in a future Diagnostics working group deep-dive.

gireeshpunathil commented 3 years ago

@Qard - here are some generic exit criteria for experimental APIs. This does not include any API specific constraints that the author / developer might have identified / have in mind:

pls ignore if some of the items are not applicable. Otherwise, stating and asserting the stand of each of the above, I guess will help us take this forward.

Shahor commented 3 years ago

Aloha 👋

First of all, I really hope this is the right place to chime in about this, I am very new to contributing to the node project so I hope I'm not just wasting your time 🙇

I started playing with AsyncLocalStorage today as I was in need of APM-like context tracking in my express app, and wanted to remove old custom patches that are now becoming an issue with packages upgrades.

Anyhow, I've been following this feature for months (and wanting to use something like this for years even) so it's way of working is not really new.

The thing is I believe the documentation is not clear enough on what the difference between .run and enterWith is exactly, and that lead me to do a quite long session of lib diving and hair pulling.

As I understand it now from everything I've read in this issue, both run and enterWith do the same thing with the exception that run creates a new separate AsyncResource to deal with the context.

Because of that, in an express world with body-parser and receiving POST requests, we are pretty much in this scenario:

const { PassThrough } = require("stream");
const { AsyncLocalStorage } = require("async_hooks");

const als = new AsyncLocalStorage();
const request = new PassThrough();

function callback() {
  console.log(als.getStore());
}

als.run({ id: "stream" }, function () {
  request.on("data", function (chunk) {
    // noop
  });
  request.on("end", function () {
    callback(); // store will be undefined
  });
});

request.push(Buffer.from("whatever"));
request.push(null);

(I don't know the internals of express so I might be wrong here, but it's an approximation that surfaces the problem).

Because the data in the request is received outside the als.run it is ran from a different context, existing prior to the als.run one and therefore we won't receive the store as we'd expect.

This version works though:

const { PassThrough } = require("stream");
const { AsyncLocalStorage } = require("async_hooks");

const request = new PassThrough();
const als = new AsyncLocalStorage();

function callback() {
  console.log(als.getStore());
}

als.enterWith({ id: "stream" });

request.on("data", function (chunk) {
  console.log(chunk.toString());
});
request.on("end", function () {
  console.log("end");
  callback(); // store will be exactly what we want
});

request.push(Buffer.from("whatever"));
request.push(null);

As the data parsing will happen in functions called from some context that is a direct descendant of the one in which the enterWith was called.

The problem here is that most of this code is hidden from the user (as it is done by the raw-body lib, used by body-parser), and I would think that the majority of node users would not understand why it would not work properly because it requires some knowledge of the internals and how this works.

Is there any way we can surface those better in the documentation to try and alleviate this misconception about those two methods and how they differ?

Qard commented 3 years ago

Event emitters aren't async, so no async resource is created when attaching event handlers. All it does is push the handler into an array. When you push data outside the run, it will run those handler functions directly in the context where the push happens, which is outside the scope, so there's no context to link. The correct thing to do would be to do the pushes inside the run handler.

There has been discussion in the past about possibly wrapping every event handler in an AsyncResource, but that is both costly and has had much debate about if is correct to bind to where the handler is attached or to where the emit happens, so no movement has really happened there.

Shahor commented 3 years ago

The correct thing to do would be to do the pushes inside the run handler.

yes, I understand that now.

Unfortunately that is something we don't have control over in an express setup like follows:

// yarn add express body-parser
const Express = require("express");
const { AsyncLocalStorage } = require("async_hooks");
const BodyParser = require("body-parser");

const PORT = 3000;
const als = new AsyncLocalStorage();
const app = Express();

app.use([
  function (request, response, next) {
    const store = { id: Math.random() };
    als.run(store, next);
  },
  BodyParser.json(),
]);

app.post("/", function (request, response, next) {
  const store = als.getStore() || {};

  response.json(store);
});

app.listen(PORT);
console.log(`Listening on port ${PORT}`);

image

As you can see, as soon as there is a payload to be parsed, als.run's context is lost, that is due to raw-body treating the request as a stream.

I don't understand the internals enough to know exactly why, although I suppose that the data pushing inside the request is handled outside the newly entered context, which enterWith would force regardless.

Qard commented 3 years ago

There's already a pull request to fix that in raw-body: https://github.com/stream-utils/raw-body/pull/71

Shahor commented 3 years ago

Thanks for your patience and explanations 👍 This makes a lot more sense (and is so valuable to have in core <3)

Qard commented 3 years ago

No worries. Sorry if any of this is unclear in the docs. I definitely feel the docs around all this could use some work.

Shahor commented 3 years ago

It's fine since the discussion is open :)

Shahor commented 3 years ago

Would it be possible to surface an als store through some events 🤔 ?

Considering the example in this comment

process.on('unhandledRejection', () => {
  als.getStore() // undefined here
})

This would always be undefined. I suppose there is no way to interact with this EventEmitter in such a way that it would be ran under the ALS ?

Qard commented 3 years ago

An unhandled rejection could come from anywhere, so no there's no way to bind that to an AsyncLocalStorage instance. The context in an event handler will be whatever the context was where the emit was called. In this case, it'd be in Node.js internals somewhere, not from the promise that the error originated from.

mmarchini commented 3 years ago

@Qard should we remove from diag-agenda until we have a proposal for exit criteria?

Qard commented 3 years ago

Sounds reasonable. :)

vdeturckheim commented 3 years ago

@gireeshpunathil thanks a lot for the list of exit criterea, here is my take on it for the follwoing scope:

I am restricting to this scope as I believe this is a viable minimal API for AsyncLocalStorage that has no strong controverial points - there might be discussions about the DX of this scope but AFAIK there are no strong opposition against these methods and this constructor. This scope is, IMO, the most likely to reach consensus.

user interfaces (flags, API semantics): are those finalized / ratified internally?

-> On the given scope, yes, the only debatable method is run and if we decide to make an alternative the mains canonical way of starting context spreading, there will always be a trivial way of building run over it.

platform coverage: is the capability available in all (identified) platforms?

-> I am not aware of any platform limitation for this scope

bugs: how many critical bugs outstanding? is the backlog ratified?

-> AFAIK there is no bug opened on the API per se. There is a know performance cost in promise-heavy environment but it should not be considered as a blocker: enabling AsyncLocalStorage is optional and the end user has the choice (directly or through packages they use) to pay that cost. However there are bugs linked to AsyncLocalStorage when ecosystem packages break context propagation (for instance here https://github.com/nodejs/help/issues/3186). The solution is actually to move the API to stable to push adoption/compatibility forward and get rid of these bugs.

worker support: will feature work seamless within worker threads? or special settings required, and documented?

-> I believe there is no impact here

test coverage: unit test and smoke test. are all the APIs covered? are there known flakes?

-> There are 15 test files for the AsyncLocalStorage API covering multiple use cases and situations. (All the ones which name start with async-local-stoagre here https://github.com/nodejs/node/tree/master/test/async-hooks)

serviceability: upon production issues emanating from this feature, do we know how to debug?

-> This one is probably the one where we probably can improve the most. I have spent a lot of time debugging context spreading issues and I am probably too used to this to be able to give a satisfactory response here. When we built this API, I had the feeling that it would mostly be used by APM/RASP providers. We have opened issues about the API in this repo and in nodejs/help that actualy prove me wrong on this (which actually makes me very happy). There is probably room for a debugging guide or at least a few debugging tips in the doc and I woul dbe happy to add them to the PR that would mark the scope as out of experimental.

documentation: are the capabilities documented?

-> There have been a few iterations on the doc and I believe it is clear enough as of right now.

evidence of adoption: production deployments, npm modules, framework stacks.

-> It is hard to find numbers for this but the DataDog APM for Node.js actually uses this API which means there must be a large amount of Node.js live application using it through this module.

dependencies on experimental / flaky upstream APIs (such as v8): are there any, and if so, do we know mitigation plans.

-> AFAIK @Qard is still working on JS-side promise hooks, which will affect all of this, but only as an implementation detail and we can move forward without blocking on it.

This is a very personal analysis which has 2 major biases:

Thanks again @gireeshpunathil for providing this great framework. Based on this, I would like to suggest the defined scope to be marked as out of experimental and I will soon open a doc PR in that direction. Let me know what you think :)

Aschen commented 3 years ago

We are using AsyncLocalStorage in Kuzzle since more than 7 months and we didn't encounter any issue or any bad API design flaws.

Despite we know our clients are using this feature in production since months, since we are developing a Node.js framework we can affirm that many other users are using AsyncLocalStorage without any inconvenient.

:+1: for @vdeturckheim proposal to make this API stable so other products may want to take advantage of this feature as well

bmeck commented 3 years ago

That scoping seems fine to me.

Qard commented 3 years ago

I agree with that scoping also. Nothing controversial in there to me, and all that has been battle-tested for months now in the Datadog APM agent. Any issues we've had have been with context loss due to AsyncResource not being marked stable yet and therefore some libraries declining to adopt it, so I suggest we mark both as stable for 16.x and start promoting adoption and migration from using async_hooks directly where possible.

mhdawson commented 3 years ago

+1 for making non-experimental. @vdeturckheim do you know when you might open the PR so I can keep an eye out for it ?

vdeturckheim commented 3 years ago

@mhdawson I have a project ending early next week, so I planned to work on this in the next couple weeks (there might be some big doc updates to do)

mcollina commented 3 years ago

+1 for making non-experimental.