open-telemetry / opentelemetry-specification

Specifications for OpenTelemetry
https://opentelemetry.io
Apache License 2.0
3.75k stars 889 forks source link

behavior of context and trace in the presence of timers #3144

Open blumamir opened 1 year ago

blumamir commented 1 year ago

Edit: it turns out I was wrong - nodejs and web ARE consistent across timers. This issue still discusses if the default behavior is the most useful one for normal users, and if it makes sense for the spec to state the behavior explicitly.

I recently started using OpenTelemetry for the browser, using the JS SDK.

In OTEL JS, there are separate context managers for browser and node, one uses nodejs async hooks ~which only preserves context across async operations~, and the second uses zone.js library ~which preserves context across async calls, timers, event handlers etc. My findings are that it creates a different behavior for the same JS code that runs in the browser vs node - single large trace vs multiple short unrealted fragmented traces~

In our application, the browser has some polling that fetches an API once every 30 seconds. The generated trace using the current context manager which propagate context across timers makes it so we group all these operations into one single infinite trace which can last for hours and contain many thousands of spans (across browser and the instrumented services). This behavior is not friendly to our tail sampler, and the huge output trace makes it difficult for human end users to extract value from such traces.

I searched the spec to try to understand what is the expected behavior for context managers but couldn't find any reference. Generally, my question is:

Giving a timer registration in the instrumented code, should operations in the timer callback be part of the invoker trace or should they start a new trace

I guess different users can have different needs depending on the use case, but:

Would love to hear more opinions on this issue Do you think the spec is the right place to define the expected behavior and give guidance on how to instrument in practice?

yurishkuro commented 1 year ago

Do you think the spec is the right place to define the expected behavior and give guidance on how to instrument in practice?

After working with tracing for almost 8yrs I don't think I can formulate a one-size-fits-all strategy of how to deal with deferred work. It's very much dependent on the specific scenario and how the application wants to define the "workflow".

blumamir commented 1 year ago

Do you think the spec is the right place to define the expected behavior and give guidance on how to instrument in practice?

After working with tracing for almost 8yrs I don't think I can formulate a one-size-fits-all strategy of how to deal with deferred work. It's very much dependent on the specific scenario and how the application wants to define the "workflow".

I agree. If I did not misunderstand - current SDK implementations are free to implement whatever behavior they desire, with no consistency no documentation, and no configuration options. These can lead to a bad user experience and infinite traces by default.

Tried to trace the reasoning behind the decisions in JS SDK, but it seems just random and not following any documented logic to justify the implementation. @vmarchaud - maybe you know why things were written as they are?

vmarchaud commented 1 year ago

If I did not misunderstand - current SDK implementations are free to implement whatever behavior they desire, with no consistency no documentation, and no configuration options. These can lead to a bad user experience and infinite traces by default.

I don't agree on this, i'm pretty sure that if you start a async loop from within a trace, most SDKs will tie each iteration to the parent trace resulting in infinite traces. I agree that you might not want that in your case but this is consistent with how context propagation works in other cases.

Tried to trace the reasoning behind the decisions in JS SDK, but it seems just random and not following any documented logic to justify the implementation. @vmarchaud - maybe you know why things were written as they are?

It isn't a decision for timer per se, more of a logical pattern across nodejs callback in general so it's not random. The implementation is mostly ported from opencensus which was then ported from google tracing agents which was inspired from what everyone was doing at the time.

Again i can understand that you might not want that in the browser but saying that its random isn't true, it's how everyone was tracing code since it was available in NodeJS (i believe around node 4).

blumamir commented 1 year ago

Thank you @vmarchaud for the input.

If I did not misunderstand - current SDK implementations are free to implement whatever behavior they desire, with no consistency no documentation, and no configuration options. These can lead to a bad user experience and infinite traces by default.

I don't agree on this, i'm pretty sure that if you start a async loop from within a trace, most SDKs will tie each iteration to the parent trace resulting in infinite traces.

What kind of loop are you referring to? for example, this loop will behave differently in node and browser:

import api from '@opentelemetry/api';

const tickTracer = api.trace.getTracer('timeout');
const tick = () => {
  // in zonejs (browser) - the active context here is `clickSpan`
  // in async hooks, the active context here is `ROOT_CONTEXT`

  const span = tickTracer.startSpan('interval timeout');
  // span will be root trace in nodejs, but will be child of clickSpan in browser
  // while the browser tab is open (could be days), more and more spans are added to the `click` trace

  api.context.with(api.trace.setSpan(api.context.active(), span), () => {
    // process timeout
  });
  span.end();
}

const clickSpan = api.trace.getTracer('user-interaction').startSpan('click');
api.context.with(api.trace.setSpan(api.context.active(), clickSpan), () => {
    setInterval(tick, 1000);
});
clickSpan.end();

I agree that you might not want that in your case but this is consistent with how context propagation works in other cases.

In this case, maybe the node implementation is not aligned and needs to propagate the context across timers to be consistent with other SDKs? I think addressing this in spec in a loose manner (SHOULD) will contribute to a consistent user experience and mental model for what traces represents.

Tried to trace the reasoning behind the decisions in JS SDK, but it seems just random and not following any documented logic to justify the implementation. @vmarchaud - maybe you know why things were written as they are?

It isn't a decision for timer per se, more of a logical pattern across nodejs callback in general so it's not random.

In the browser, timers callbacks preserve context because this is how zonejs operates by default. Other callbacks will not be picked up by zone and will thus not preserve context, so browser is already not consistent in this manner, which is what I am suggesting to fix.

The implementation is mostly ported from opencensus which was then ported from google tracing agents which was inspired from what everyone was doing at the time.

Thank you for the input on how things came together. What I meant to say is that the behavior of (at least 2) sdks varies in this aspect, leading to inconsistent behavior which users cannot anticipate.

Again i can understand that you might not want that in the browser but saying that its random isn't true, it's how everyone was tracing code since it was available in NodeJS (i believe around node 4).

I was wondering if you know from past discussions if this implementation is intentional. are there any reasons to connect timers in browser and not connect them in node?

blumamir commented 1 year ago

By the way, looks like splunk already encountered a similar issue and are offering custom context manger for the browser to overcome this issue in their otel distro

Splunk RUM Agent includes a custom context manager that connects parent traces with traces that happen in:

  • setTimeout (with less than 34ms timeout)
  • setImmediate
  • requestAnimationFrame
  • Promise.then / catch / finally
  • MutationObserver on textNode
  • MessagePort
  • Hash-based routers

These also cover most common methods frameworks use to delay re-renderers when values change, allowing requests done when a component is first rendered to be linked with the user interaction that caused the component to be added to the page.

blumamir commented 1 year ago

Do you think the spec is the right place to define the expected behavior and give guidance on how to instrument in practice?

After working with tracing for almost 8yrs I don't think I can formulate a one-size-fits-all strategy of how to deal with deferred work. It's very much dependent on the specific scenario and how the application wants to define the "workflow".

I agree. At the end of the line, one has to install a context manager and it will be implemented one way or another. So my question is more of - should there be a decent default? or is it left for the SDK to decide based on it's knowledge of the runtime

vmarchaud commented 1 year ago

What kind of loop are you referring to? for example, this loop will behave differently in node and browser: I was wondering if you know from past discussions if this implementation is intentional. are there any reasons to connect timers in browser and not connect them in node?

That's not how it should, setTimeout continue the context for exemple so setInterval should too. What nodejs version are you using to test this ? For me that's a bug.

In the browser, timers callbacks preserve context because this is how zonejs operates by default. Other callbacks will not be picked up by zone and will thus not preserve context, so browser is already not consistent in this manner, which is what I am suggesting to fix.

Adding it in the spec will just make the browser sdk not in spec. We can't track other callbacks in the browser because there are no API to do so, it's a known issue and sadly a enormous task that @legendecas has started (https://github.com/legendecas/proposal-async-context).

I agree. At the end of the line, one has to install a context manager and it will be implemented one way or another. So my question is more of - should there be a decent default? or is it left for the SDK to decide based on it's knowledge of the runtime

Adding on my previous comment, we are constrained by the platform each sdk target, i don't think it's reasonable to expect a standard behavior across all of them (node, browser, lambdas and in the future deno, workers like cloudflare or even bun). The problem with Javascript is the fragmented runtime landscape and its getting worse.

However i'm not saying it's impossible but considering known resources (both in the community or companies around the space) it's complicated right now. For example it took Qard ~13 months to land implementing promises hook into NodeJS: https://github.com/nodejs/node/pull/36394 + https://chromium-review.googlesource.com/c/v8/v8/+/2759188) which was i believe sponsored by Datadog. You can see the similar kind of work for Deno in https://github.com/denoland/deno/issues/7010

Flarna commented 1 year ago

Besides the fragmentation within JS the spec would have to consider other techs like Java. As far as I know Java uses thread local on default for context propagation which works nice for synchronous calls but may fail for async, thread pool based frameworks.

Limiting discussion to timer APIs is also just a small part of the topic. In JS if you promisify them special handling for timer APIs is likely lost as promise context propagation takes over. And there are other functionalities which might prolong a trace to a potential unrelated chain of operations.

In the end the data collecting side should match to the backend. It's of no help that OTel instrumentations correctly link spans for days or weeks if the backend (or the user on the backend) only wants to correlated for < 2 hours. Therefore this is something for the tuning area not to specify once for all.

mateuszrzeszutek commented 1 year ago

Besides the fragmentation within JS the spec would have to consider other techs like Java. As far as I know Java uses thread local on default for context propagation which works nice for synchronous calls but may fail for async, thread pool based frameworks.

FYI the Java instrumentation agent contains several instrumentations that carry the current context across threads - but only for one-off async tasks. Repeatable scheduled operations (e.g. ScheduledExecutorService's scheduleAtFixedRate and scheduleWithFixedDelay) do not inherit the context from the thread that submits them.

blumamir commented 1 year ago

That's not how it should, setTimeout continue the context for exemple so setInterval should too. What nodejs version are you using to test this ? For me that's a bug.

@vmarchaud , I checked again in node, and it seems you are right - the context is indeed passed across timers invocations. My issue was different. Sorry about that. So looks like the behaviors are consistent which is great.

I guess for me and for other cases (java distribution and splunk distribution of the browser SDK), it makes sense to start a new trace for timers that are set to happen in the future (setTimeout(callback, >0), setInterval()), in oppose to things that are scheduled to happen right away (set Immediate (), setTimeoute(callback, 0)).

Adding it in the spec will just make the browser sdk not in spec. We can't track other callbacks in the browser because there are no API to do so

The spec should not define required behaviors that are not feasible to implement. It can describe the preferred behavior as a SHOULD requirement which allows SDKs to diverge if needed. However, I do think that in js it is possible to achieve various behaviors with instrumentation of the timer functions / libaries or registering to zonejs/asynchooks hooks to alter the behavior depending on the kind of trigger.

It could be amazing to have some text in the spec to promote consistency and help users know what to expect. It could also be great if the spec explicitly allows implementations to be configured to support various useful cases which users can set to their needs. No matter what, eventually the same spans are generated in all cases, but how to connect them together into traces and where the boundaries are set is a subjective matter, and users might want it one way or another.

vmarchaud commented 1 year ago

The spec should not define required behaviors that are not feasible to implement. It can describe the preferred behavior as a SHOULD requirement which allows SDKs to diverge if needed

It could be amazing to have some text in the spec to promote consistency and help users know what to expect

I don't see how both statement can be true at the same time, either it's consistent or it's not.

No matter what, eventually the same spans are generated in all cases, but how to connect them together into traces and where the boundaries are set is a subjective matter, and users might want it one way or another.

I understand the use case but i'm pretty sure it's not possible using AsyncLocalStorage in node so you must use AsyncHooks which isn't great. I think you can just implement your own context manager to solve this.

blumamir commented 1 year ago

I don't see how both statement can be true at the same time, either it's consistent or it's not.

I was Imagining something like "consistent if possible", to describe the desired expected behavior if it is possible to implement, so at least the common experience is as consistent as possible.

I understand the use case but i'm pretty sure it's not possible using AsyncLocalStorage in node so you must use AsyncHooks which isn't great. I think you can just implement your own context manager to solve this.

In JS one can patch the setIntermmediate/setTimeout functions to achieve this behavior even with AsyncLocalStorage, right? (haven't actually tested it, but it sounds possible).

vmarchaud commented 1 year ago

In JS one can patch the setIntermmediate/setTimeout functions to achieve this behavior even with AsyncLocalStorage, right? (haven't actually tested it, but it sounds possible).

Yeah sorry i wasn't clear, i meant you can't do this just with ALS so you must patch timers manually which requires having to write your own context manager