Support an Envoy x-request-id propagator

open-telemetry / opentelemetry-specification

Specifications for OpenTelemetry

https://opentelemetry.io

Apache License 2.0

3.75k stars 889 forks source link

Support an Envoy x-request-id propagator #2912

Closed jamielennox closed 4 months ago

jamielennox commented 2 years ago

We run a large number of services using istio and envoy as a service mesh and at the moment are trying to use opentelemetry for trace collection to tempo.

Currently we are trying to achieve this at the platform level rather than require too much configuration from the applications so are exporting w3c spans from envoy to otel via the opencensus handler (there is a open issue in envoy to use otlp directly).

Separately to whatever trace headers you send envoy requires you to forward the x-request-id header . If you do not things like getting a stable trace sampling across services will not be possible.

It would be great for opentelemetry to have a way to create a propagator for x-request-id such that from the OTEL_PROPAGATORS or similar configuration we could instruct applications that we must forward this header to get successful trace data.

Whilst it is arguably not a trace header i do believe it fits under the header of context propagation and the growth of envoy makes it reasonable to include across languages.

jamielennox commented 2 years ago

Note, i'd also be happy with a generic header propagator that i could configure with headers i care about in the standard environment variable/config way. That doesn't seem the direction we are taking though.

OTEL_PROGPAGTOR_GENERIC_HEADERS=x-request-id,request-id,correlation-id

carlosalberto commented 2 years ago

Hey @jamielennox

Overall this sounds about right, and it should be possible to add it to the OTEL_PROPAGATORS env var once an implementation exists somewhere (contrib repo maybe, as Envoy as a well known OSS component). We need however contributors to make this happen ;)

akira-kurogane commented 2 years ago

Wait, what about span id, parent span context, sampling decision headers?

If you say 'Here is another trace id context propagation method' a lot of people will assume that it provides what the others do. (Others = zipkin "b3" style, W3C trace context, older Jaeger style, older Opentracing.)

Then after going with it they will have no controlled sampling behaviour and incorrect parent-child relationships, and they will have a super-frustrating debugging it.

In the Istio environment my apps are in I can see it adds x-request-id, but that's only for Istio's purposes as far as I can tell. It also adds x-b3-traceid, x-b3-sampled, x-b3-spanid, x-b3-parentspanid headers. I don't know if that's optional by config of Istio as I'm not the Istio admin, but it is present in my case.

It's only the zipkin b3 stuff that gives me hope I can get span parent-child relationships correctly traced.

jamielennox commented 1 year ago

A few rebuttals 😄:

For me a propagator is about context propagation which is not strictly the same as tracing. Envoy logs all access logging against this request-id so even ignoring the span concept of tracing i want this header to be propagated to correlate access and application logs across distributed systems.
I don't think we can be responsible for people not understanding what the different span types are. Propagators are something you can optionally configure and this wouldn't necessarily be enabled by default. For me what makes this (and a lot of opentelemetry) important is that i can coordinate between the application and the platform to get the right outcome for the whole system without asking every team to build this logic. As a platform owner i need this header for the system to work.
For sampling your assumption is that it's an otel collector doing that work, at least in my current setup it is not. I'm using istio configuration to set the sample rate because it is already a sidecar on my application, and from a platform level i can very easily adjust this value without restaring or modifying the application (something i hope that opamp or similar might allow from otel). Unfortunately because envoy is agnostic to the header format it uses x-request-id to perform a stable sampling. So basically unless i can forward x-request-id i have to redo the architecture.

Essentially though, this would never be usable as a replacement for span headers, very useful in addition.

akira-kurogane commented 1 year ago

[We're not] responsible for people not understanding what the [x and y and z] ..

No. Human intepretation about a tool is crucial factor in that tools' success-or-fail outcome.

If you mean that x-request-id will be used as a traceid

(This is how I originally interpreted your proposal, because it's easy to wish that OT just used "x-request-id" in place of "x-b3-traceid", if you're in an Istio environment.)

Can you take responsibility for making it reasonably easy for users to understand that just this one context propagation methods is traceid-only? Unlike the traceid and spans and parent-span examples that the learning resources all show.

It will be tricky to explain. Spans records will continue to be generated by OT tools/agents, each one with some new, random spanid. So users will see spans in their tracing UI etc.. But they will often be misleading because the parent-child info is missing.

If you mean that x-request-id won't be used as a traceid:

It will be too hard for a user to understand that some propagation method of tracing data, other than the one called "baggage", is not for tracing that OT tools do. The way propagation is explained in so many OT learning resources a tracing purpose is going to be the default assumption. The four required propagators of OT tools are: W3C trace context, B3 (tracing), Jaeger (tracing) and W3C baggage.

There's a big responsibility to do documentation and education work if this is changed.

To me it sounds like your proposal is 'baggage propagation of a user-supplied matchlist of headers, without nesting inside a "baggage" header', or 'matchlist config to select arbitrary headers for propagation as hack for external systems' . But then I wonder, why not this way in the first place? Why did OT working groups decide to nest baggage key-value pairs? I assume there's a good reason.

jamielennox commented 1 year ago

So, x-request-id is not a span/tracing header in the tracing system, as much as I wanted envoy to sort this out for me it cannot be used as such and you need to propagate additional headers to get spans (istio defaults to b3). Traditionally I'd define it as a correlation header as (amongst other things) it can be used for linking log messages. However as OpenTelemetry is linking logs with traces though spans should I call it a trace header now? Terminology is hard.

So my question comes down to scope of the propagators.

If the only valid use case of the propagators is to support spans and baggage as defined by w3c, jaegar, skywalker, etc then you're correct this is different.
If we consider propagators for context propagation and baggage that can be used for logging and other correlation, then I think x-request-id is valid. I would like to be able to use x-request-id propagation in systems that have "tracing" entirely disabled.

I'm not sure about defining it as user-defined baggage, but i'm still figuring that piece out. To me baggage is different in that the examples i've seen are things like user-id or tenant-id. Things that are very useful to propagate and that you likely want in logs/metrics but are more application specific and more dynamic (in terms of the keys that are available). Versus more platform headers like the w3c/b3 or x-request-id headers that are read by a service mesh or otel collector.

I can agree that maybe a generic forwarder is appropriate, this was my follow up in comment 2 where i'd be ok with a generic propagator where i could manually specify headers. I'm also a fan of clarity for users so i was worried that might be too vague and easy to abuse.

akira-kurogane commented 1 year ago

If the only valid use case of the propagators is to support spans and baggage as defined by w3c, jaegar, skywalker, etc then you're correct this is different.

I wish OT propagation was just what you'd expect from the general definition of word. However tracing-with-spans and baggage are the use cases being shown in OT learning resources, and the lack of options to do otherwise suggests it's intentional. That's the suggestion that existing users will have received already, and new users of the foreseeable future will continue to get too.

OK, let's rewind:

In comment # 1 you proposed modifying OTEL_PROPAGATORS to accept a new option so "x-request-id" would be propagated. OTEL_PROPAGATORS takes a list of groups ("b3" means the x-b3-* headers, "tracecontext" means "traceparent" + "tracestatus" headers, etc.) so I presume "x-request-id" or maybe "envoy" or "istio" would be the propagation group name you would add in a patch, and it would propagate the single-item list of "x-request-id".
In the short comment # 2 you made the alternative proposal for OTEL_PROPAGATOR_GENERIC_HEADERS. It would be a new option; takes the exact headers rather than a short name for a set of them like OTEL_PROPAGATORS does.
Then Carlos commented approval for the idea of changing OTEL_PROPAGATORS

My previous comments are from what I imagined users doing, and then being confused about, if the # 1 proposal was released.

I wish Carlos had commented approval for # 2 instead ... any input now @carlosalberto ?

For OTEL_PROPAGATOR_GENERIC_HEADERS I'd strongly suggest "forced" or "custom" should be in the option name too. So the name alone projects it is a non-default thing, not the way you would enable things you already heard about in other parts of the documentation.

Also a 'forced' or 'custom' name would hopefully make people realize its unpredictable that what will happen if your distributed requests pass through non-OT components. For your use case it will, hopefully, get things just right for Grafana Tempo, but passing through arbitrary headers must have potential for unwanted overriding, flooding, etc. Or worse, it becomes black magic - it makes something work, someone puts it in stackoverflow as a solution, it gets cargo-cult-copied until the end of time - and that blocks people from understanding how distributed tracing things really work.

jamielennox commented 1 year ago

So having seen this implemented in a couple of languages i see that the expectation of propagation being limited to tracing with spans is an expectation in the code as well. Realistically beyond that you can see a strong assumption that w3c headers is what propagation is for. I disagree with this and think x-request-id is a good example of a useful and valid propagator but it makes the question a bit bigger than just the training materials.

In golang almost all implementations are using a SpanContext for propagating with very w3c things like string length and hex requirements, leading to implementations like aws having to deconstruct and reconstruct their headers for no perceivable reason/benefit. In golang the interface only specifes a context, so implementations such as opencensus can bypass that and it's pretty easy to do a propagator for x-request-id.

Similarly in java the textmappropagator has a generic context parameter that tries to make you use the opinionated SpanContext but will let you store any key/value (though this time opencensus does convert to a SpanContext).

In dotnet this is a bit more of a pain, i think because of the Activity system. The TextMapPropagator dictates a PropagationContext which supports the ActivityContext (which looks like the prototype for SpanContext in other languages and again is w3c specific) and a Baggage. In this case we end up having to use baggage explicitly for key/val storage, which in practice is fine but i don't know how this would interact with the actual "baggage" propagator yet.

In short, it works in the languages i've tried, but it feels like opentelemetry is strongly biased to w3c and the concept of generic or multiple span propagation is not as easy as i would like it to be. This would be even more complicated with a generic header propagator.

As a side note, the libraries are resistant to adding custom propagators easily. The registry for OTEL_PROPAGATORS in most languages is hardcoded to the known propagators and it is an exception to specify something in this list that is not available. This can be taken up seperately but i think conveys that library authors don't want people to use propagators outside of the core lib (hence the need to build in). It's a shame, as it works really well in most cases.

WojciechKuk commented 1 year ago

can't envoy conform to W3C? maybe it's worth adding traceparent support in envoy and istio (in place of x-request-id, which was introduced a few years ago due to lack of better standards at the time)

tonglil commented 1 year ago

FWIW you can write your own propagator(s) and use them in addition to the standard ones. The nice thing about OTel is that it's able to extract (parse) and inject (transmit) multiple formats and headers.

For example, you can modify the baggage propagator into your own propagator: https://github.com/open-telemetry/opentelemetry-go/blob/v1.13.0/propagation/baggage.go#L29

I created a datadog trace propagator because one doesn't exist: https://github.com/tonglil/opentelemetry-go-datadog-propagator/blob/main/propagator.go

    propagator := propagation.NewCompositeTextMapPropagator(
        propagation.Baggage{},
        propagation.TraceContext{},
        b3.New(b3.WithInjectEncoding(b3.B3MultipleHeader|b3.B3SingleHeader)),
        ot.OT{},
        jaeger.Jaeger{},
        xray.Propagator{},
        datadog.Propagator{},
        // your own x-request-id propagator here
    )

    otel.SetTextMapPropagator(propagator)

I agree with @WojciechKuk that istio should add an option to conform to w3c, although even not necessary if they had an option to switch to b3 headers for correlation (of requests to logs, and maybe traces).

Something to be aware of with x-request-id is that istio can change the id depending on specific actions, so it shouldn't be used as a "trace" id (at least naively without some parsing): https://www.envoyproxy.io/docs/envoy/latest/api-v3/extensions/request_id/uuid/v3/uuid.proto#envoy-v3-api-msg-extensions-request-id-uuid-v3-uuidrequestidconfig

pavolloffay commented 1 year ago

FYI: request to support tracecontext in envoy https://github.com/envoyproxy/envoy/issues/30052

johnzheng1975 commented 1 year ago

  propagator := propagation.NewCompositeTextMapPropagator(
      propagation.Baggage{},
      propagation.TraceContext{},
      b3.New(b3.WithInjectEncoding(b3.B3MultipleHeader|b3.B3SingleHeader)),
      ot.OT{},
      jaeger.Jaeger{},
      xray.Propagator{},
      datadog.Propagator{},
      // your own x-request-id propagator here
  )

  otel.SetTextMapPropagator(propagator)

For x-request-id propagator, I have to implement Extract, Inject by myself, instead of put a simple string "x-request-id", right?

austinlparker commented 7 months ago

On review, we're interested in how relevant this particular issue is given that Envoy now supports OpenTelemetry and W3C Trace Context.

svrnm commented 4 months ago

@jamielennox I am closing this issue now, if you think this is still relevant, feel free to re-open it or create a new issue with an updated ask.

rkrzewski commented 2 months ago

I'm looking into introducing distributed tracing to a Java application running on Kubernetes cluster where Istio and Jeager are already deployed. I came across this issue after exploratory searches for Open Telemetry + Istio.

As I understand the situation (please bear with me, I'm just getting my feet wet with this subject):

Envoy uses custom x-request-id header for internal request correlation. They don't consider it a distributed trace id, but require the workloads to propagate it from input to output. Envoy supports distributed tracing using pluggable "tracers" (including Zipkin, DataDog and others) that use their protocol-specific headers.

https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/observability/tracing#arch-overview-tracing-context-propagation

Whichever tracing provider is being used, the service should propagate the x-request-id to enable logging across the invoked services to be correlated.

@basvanbeek reaction to proposal replacing W3C tracecontext instead of x-request-id is a "not going to happen" because of backwards compatibility:

https://github.com/envoyproxy/envoy/issues/30052#issuecomment-1831641644

x-request-id, and to a lesser extend x-correlation-id, have been around for a long time and are typically used to propagate, identify, and correlate requests also disconnected from and predating distributed tracing (...) And some ecosystems actively operate and make decisions on these values. Yes it is a mess but we can't be constantly breaking stuff for site owners and making it worse.

@austinlparker said above

Envoy now supports OpenTelemetry and W3C Trace Context.

Which is true, with respect to distributed tracing, but to the best of my understanding still requires the workloads to propagate opaque x-request-id header none the less.

I think some action on OpenTelemetry side on this is warranted, given that the situation on Envoys side is extremely unlikely to change.

Regarding the proposals:

@akira-kurogane appears to be strongly opposed to introducing a new mechanism for propagating generic, opaque headers (orthogonal from span context / baggage propagation) because users could misunderstand it and misconfigure their systems with potential performance and security implications.

(final paragraphs of https://github.com/open-telemetry/opentelemetry-specification/issues/2912#issuecomment-1326960469)

Dedicated propagator envoy handling opaque x-request-id header was also rejected by @akira-kurogane because it could be confusing to users, expecting propagators to handle tracing / context propagation exclusively.

(second part of https://github.com/open-telemetry/opentelemetry-specification/issues/2912#issuecomment-1324623015)

I don't think it is an insurmountable problem, please consider the following:

Note! envoy propagator handles only x-request-id header used by Envoy for internal request correlation. If you want to use distributed traces, you must use another propagator in addition to envoy handling the actual trace context propagation that matches the configuration on Envoy's side. For example: use envoy,tracecontex,baggage when using opentelemetry tracer in Envoy, or envoy,b3multi when using zipkin tracer.

(I'm not sure the configuration suggestions above are correct, because I'm still trying to wrap my head around this, but you get the gist)

I think that adding a dedicated propagator is the preferable course of action because it's easier to document and harder to misuse.

I just took a look at java AWS X-Ray propagator source ant it seems that implementing envoy propagator would be trivial: extract x-request-id header, store it in Context under EnvoyRequestId key or similar, don't touch the Span and Baggage, inject the header back on the other side, done.

I could try making a PR to opentelemetry-java-contrib to that effect but please let me know if there's a chance to get it merged.

akira-kurogane commented 2 months ago

I'll pull out of this discussion. I did think the proposal was against user expecation, but it was so long ago that I made this post, and since I last used opentelemetry, I wouldn't know if conventions are changed.