Adding a means to denote preference for Link vs Span in given Context

hibachrach commented 5 months ago

First time contributor here--please let me know if there's anything I need to correct procedurally!

What are you trying to achieve?

I'm hoping to extend the API description for Context as it pertains to Traces (or perhaps change the API description for Traces?)

Here's the problem I'm trying to solve:

Consider a user flow in which a file conversion service accepts an HTTP request. The user's intent is to upload a file to convert it from one file type to another (e.g. .pdf -> .png). For infrastructure performance reasons, this work is enqueued in a messaging system/background job processing system (e.g. Sidekiq in the Ruby ecosystem). As part of handling the same request, a similar message/background job is sent to the same system. This flow is depicted below:

graph TD
    A[Client] -->|Requests| B[Request Handler]
    B -->|Enqueues| C[Background Job for File Conversion]
    B -.->|Enqueues| D[Background Job for Analytics]
    style D stroke-dasharray: 5 5

From a user's perspective, only the solid-line nodes in the above graph are important. The dashed lines are merely an implementation detail. Similarly, only the Spans related to the solid-line nodes are, to me as the engineer monitoring this system, significant. If the analytics-related message is not complete for another hour, that does not matter to me very much.

This maps neatly to the notion of child Spans & Links: for a step in a flow where it is all really part of one high-level request, one would hope to have that represented as a child Span. It is truly just part of the overall Trace (or, at least, what I would expect a Trace to capture). On the other hand, the analytics message is relatively detached. While a Link could point me to its source, it is not something that is part of the user journey.

What did you expect to see?

Ideally, this is something that you'd be able to hint at via Context. Here's what the above would look like via pseudocode:

function requestHandler():
  Context.current.setPropagationStyleHintTo('Child'):
    BackgroundJobSystem.enqueue(FileConversionTask)
  BackgroundJobSystem.enqueue(AnalyticsTask)

This way, the Otel instrumentation for BackgroundJobSystem would know how to relate the next Span it creates to the current Span. This hint would reset within further nested spans (e.g. any tasks that FileConversionTask may enqueue would, by default, be connected via Links.

Some open questions that I don't have the answer to just yet:

What makes sense as the default propagation style? My guess is this would be defined by the instrumentation library. E.g. Sidekiq's uses Links by default.
~Should an instrumentation library be permitted to disregard this hint?~
- I would think yes: in the Sidekiq example, there would necessarily be the creation of both the PRODUCER and CONSUMER Spans and you'd want it to apply to both. So the instrumentation library needs to be given some flexibility in interpretation. Also, making it a requirement would necessarily be a breaking change I would think?

Additional context.

I was directed here from https://github.com/open-telemetry/opentelemetry-ruby-contrib/issues/991 as a contributor thought this may make sense more generally.

dmathieu commented 5 months ago

Something similar is already being done by instrumentation libraries. For example, the WithPublicEndpoint option in Go: https://github.com/open-telemetry/opentelemetry-go-contrib/blob/a91e60beace7c1d96302fe359050d705c4723e72/instrumentation/net/http/otelhttp/handler.go#L140-L146

Ruby has this proposal open for the Rack instrumentation: https://github.com/open-telemetry/opentelemetry-ruby-contrib/pull/837

It seems your pseudo-code proposal would be the same as:

link = OpenTelemetry::Trace::Link.new(OpenTelemetry::Trace.current_span.context)
span = tracer.start_root_span(
  "FileConversion",
  with_parent: OpenTelemetry::Context.empty,
  links: [link]
)

It's a bit more verbose, but nothing a simple helper method can't resolve. Does it really need a spec change?

hibachrach commented 5 months ago

Hmm, perhaps I'm misunderstanding, but I don't think your code snippet isn't equivalent to what I'm describing. Your code allows one to sever the parent-child relationship between Spans that is present by default, but it doesn't allow for the opposite (converting what would be a Link by default into a parent-child relationship).

trask commented 5 months ago

cc @open-telemetry/semconv-messaging-approvers

pyohannes commented 5 months ago

@hibachrach If I understand correctly, you enqueue messages (or background jobs), and at enqueue time you want to specify whether the span for dequeuing the message (or running the background job) is created as a child or a link of the current span.

This effectively requires two parts:

A standardized way for propagating this decision or hint. For W3C, this might be part of the tracestate header.
On instrumentation side, producers need to populate this hint in the outgoing context, consumers need to extract and honor it.

This is definitely a valid use case, however, I'm wondering whether the need is widespread enough to merit a specification change and implementations in all languages. This issue is a great place to collect use cases and gauge interest.

Currently and without this capability, this needs to be solved on instrumentation side. In your case, the (sidekiq) instrumentation library could provide a configuration option to specify for which jobs you want to have parent/child relationships and for which you want to have links. I know of some instrumentation libraries (for example AWS Lambda for .NET) where such an option exists on a global level and allows switching between parent/child and link relationships.

lmolkova commented 5 months ago

Both static and context-based configuration for instrumentation are problematic

Static configuration is not enough since some jobs are parented, others are linked.
Context-based - there could be other spans nested under Context.current.setPropagationStyleHintTo('Child') which would be affected whether application author wants it or not

This sounds like a visualization concern rather than instrumentation/data collectopm issue.

@hibachrach could you please elaborate on

From a user's perspective, only the solid-line nodes in the above graph are important. The dashed lines are merely an implementation detail. Similarly, only the Spans related to the solid-line nodes are, to me as the engineer monitoring this system, significant. If the analytics-related message is not complete for another hour, that does not matter to me very much.

Could you please help me understand why is it important to use links (and not parent-child) for some of the background jobs:

is it a visual noise problem?
is it related to billing?
does it skew some additional analysis like e2e latency?
anything else?

hibachrach commented 5 months ago

@pyohannes

Currently and without this capability, this needs to be solved on instrumentation side. In your case, the (sidekiq) instrumentation library could provide a configuration option to specify for which jobs you want to have parent/child relationships and for which you want to have links

Yes, this was my first thought as well--see https://github.com/open-telemetry/opentelemetry-ruby-contrib/issues/991

@lmolkova

Both static and context-based configuration for instrumentation are problematic

Static configuration is not enough since some jobs are parented, others are linked.

Context-based - there could be other spans nested under Context.current.setPropagationStyleHintTo('Child') which would be affected whether application author wants it or not

So context based solution would not reset it for the whole subtree, but merely for the immediate child. At the discretion of the instrumentation library author for that child node, it could continue propagating. Otherwise, it would reset.

This sounds like a visualization concern rather than instrumentation/data collectopm issue.

[...]

Could you please help me understand why is it important to use links (and not parent-child) for some of the background jobs:

is it a visual noise problem?

is it related to billing?

does it skew some additional analysis like e2e latency?

anything else?

I would say it's all of those items you listed:

It does add a lot of noise to a trace, though I know that wouldn't be a big issue with our tracing SaaS product of choice (Honeycomb)
AFAIK, there's not a way generally to ensure both ends of a link get chosen if using sampling. This effectively makes links useless for us. This is the primary reason I think this makes sense to tackle at the instrumentation/collection level.
As of today, we aren't able to do analysis on total trace duration (merely span duration) but my hope is that we'd be able to in the future. And yes, when that day comes, having "extra" subtrees in the trace would hamper that analysis.

dmathieu commented 5 months ago

So context based solution would not reset it for the whole subtree, but merely for the immediate child. At the discretion of the instrumentation library author for that child node, it could continue propagating. Otherwise, it would reset.

What if you want the setting to keep propagating, but the instrumentation author decided otherwise?

Yes, this was my first thought as well--see https://github.com/open-telemetry/opentelemetry-ruby-contrib/issues/991

This issue is very relevant, and maybe it does entitle a specification change. But does it require an API change? A specification change could be something that asks (official) instrumentation authors that their libraries may provide an option to propagate the parent span as a link rather than the default parent/child relationship.

cc @arielvalentin

lmolkova commented 5 months ago

thanks for the details @hibachrach

I do agree that all of these problems are good ones to solve, but I wonder if the proposed solution actually solves them in general case.

First, application need to customize parent/links relationships for specific spans in the system. It would improve the situation, in some case, but

visual noise can happen with parents or links
inconsistent sampling for links would force people to pick parent-child relationship and would increase the noise and skew e2e analysis
e2e latency would be a problem either way since only the specific application knows where the end is and which operations are fire-and-forget.

I'd suggest to tackle each of the problems individually and in a generic way (ask backends to improve links support/visualization, improve sampling with links, come up with ways to identify how to measure e2e latency, etc).

One more thing:

In many cases having links is not a choice, but a necessity - e.g. when pulling message from a queue, you don't know the context of the message(s) you're about to pull and can't use parent-child relationship. For the systems that don't have this problem, e.g. in-memory queues, parent-child may be a better default.

svrnm commented 1 month ago

@hibachrach is this still something you are interested in? please take a look at @lmolkova's comment

hibachrach commented 1 month ago

I'm interested in the general problem but not married to any particular solution--if it would help to close this issue to focus on different manners of addressing it, feel free to do so.

Thanks for your checking in 😄

open-telemetry / opentelemetry-specification

Adding a means to denote preference for Link vs Span in given Context #4054