open-telemetry / semantic-conventions

Defines standards for generating consistent, accessible telemetry across a variety of domains
Apache License 2.0
242 stars 153 forks source link

AWS X-Ray environment span link #230

Open rackerWard opened 1 year ago

rackerWard commented 1 year ago

I am wondering why the section in https://github.com/open-telemetry/semantic-conventions/blob/main/docs/faas/aws-lambda.md "AWS X-Ray Environment Span Link" states the X-Ray trace context is a linked span instead of having the X-Ray context as the span's parent? If the X-Ray context is not valid the span would not have a parent. The event, if has a trace context, would become a span link. It sounds like this is created to prevent the tracing backend from receiving traces with unknown parent references. This is because AWS basically does not have an OTEL exporter for X-Ray. It seems it is written to support a tracing backend that cannot obtain the X-Ray data.

I'd like to see the specification describe the intent of the span structure it is intended to create. To me, the current wording is coupled with the way these systems are implemented today. Would this be written differently if the X-Ray only tracing data was available?

A problem with this section is a span link for X-Ray does not work in tracing backends that do not support span links. For example, Datadog does not support span links between different traces. However, Datadog is able to import X-Ray traces. A complete picture is created by importing the X-Ray traces to complement the OTEL published data. It does not require publishing all traces to X-Ray. X-Ray is only used to collect the trace data AWS implicitly generates. It is better to use the X-Ray context as a span parent because the X-Ray only data is imported.

It also isn't clear how to insert yourself into a distributed tracing stack. How should a library operate when it is being added to an existing distributed trace? Should a library maintain the existing parent/child relationship or branch and reference the other trace?

AWS is an existing distributed tracing. There is already a relationship described in AWS's X-Ray distributed trace.
There is a difference between using AWS X-Ray to view tracing data and understanding that AWS distributed tracing is coupled to X-Ray. Distributed tracing in AWS is tightly coupled to how AWS implements distributed tracing using X-Ray. AWS creates trace span relationships that are unfortunately only available in X-Ray. If it were possible to redirect X-Ray's traces to a collector, we could have more complete traces in other systems. The traces AWS creates in X-Ray has more information than what is available to the Lambda's invoke.

I think the convention needs to be clearer about how X-Ray is critical in getting more complete distributed traces. Getting a complete picture of the AWS environment does not require publishing all tracing data to X-Ray. It requires having the implicitly published X-Ray trace data.

AWS X-Ray needs to have an OTEL exporter for X-Ray so that this data can be distributed to the necessary systems without building special side-loaders.

There are signs of how the trace should be created. How the trace is constructed affects how it is accessed in the backend tracing system. For example, how a backend system supports trace span links affects how it constructs a map. A system that does not support trace links will not create complete maps of the services. Systems that support span links "from" create a map but may not make it easy to navigate forward between traces.

I have created a specific span architecture because of backend tracing software. I don't know if there are names for kinds of span structures, but I have used one that I call projection spans in order to help a tracing system create maps of all the systems involved in a trace when traces use a new trace ID and link pattern. The projection span's parent is the event's trace context, and the span links to the span it is a projection of in the lambda's execution trace. The lambda's execution span links to the propagated trace context. The projection span duration is the same as the span it is a projection of. I have seen a similar idea in AWS X-Ray within SQS Lambda trigger traces.
The projection span has helped a system that does not support span links document service interactions spanning multiple traces. It also helps navigate among traces in systems that support "from" span links.

It is easier to implement OTEL by going all in on X-Ray as the propagator, then linking the context in events as links.

rackerWard commented 1 year ago

AWS's documentation converting AWS Lambda Telemetry API to OpenTelemetry https://docs.aws.amazon.com/lambda/latest/dg/telemetry-otel-spans.html states the OpenTelemetry spans are based off the X-Ray trace context.

The span's trace Id, parent Id, and span Id comes from the AWS X-Ray context.

rackerWard commented 1 year ago

In the aws-lambda.md the SQS Event section states The parent MUST be the SERVER span corresponding to the function invocation.
Is this section describing the first span created during an SQS Event invoke must have the parent created from the X-Ray context?

rackerWard commented 1 year ago

this document is also related because it takes into account the nature of X-Ray and references the aws-lambda.md SQS https://github.com/open-telemetry/semantic-conventions/blob/main/supplementary-guidelines/compatibility/aws.md#context-propagation

rackerWard commented 1 year ago

The SQS (Lambda tracing active) https://github.com/open-telemetry/semantic-conventions/blob/main/docs/faas/aws-lambda.md#sqs-lambda-tracing-active describes the Span ProcBatch having the parent reference the X-Ray context.

Oberon00 commented 1 year ago

I think the convention needs to be clearer about how X-Ray is critical in getting more complete distributed traces.

I think it's quite the opposite. OpenTelemetry is driven by users and also vendors of different tracing systems and backends. The current conventions of not using XRay as parent but only as link by default comes from the practical experience that the XRay parent of a Lambda invocation otherwise breaks the distributed trace. For example, if XRay is disabled, it will have an unset sampling flag, effectively disabling tracing completely even if you meant to use a different non-XRay system.

However, it would totally make sense for any AWS Lambda instrumentation to offer an (off-by-default) configuration option use the X-Ray environment variable as preferred parent.

Oberon00 commented 1 year ago

It also isn't clear how to insert yourself into a distributed tracing stack. How should a library operate when it is being added to an existing distributed trace? Should a library maintain the existing parent/child relationship or branch and reference the other trace?

A library would typically start a new span with the "current" span as parent and then, during the operation traced by the span, set that new span as current. That way, a tree is naturally formed.

rackerWard commented 1 year ago

I think it's quite the opposite. OpenTelemetry is driven by users and also vendors of different tracing systems and backends. The current conventions of not using XRay as parent but only as link by default comes from the practical experience that the XRay parent of a Lambda invocation otherwise breaks the distributed trace. For example, if XRay is disabled, it will have an unset sampling flag, effectively disabling tracing completely even if you meant to use a different non-XRay system.

However, it would totally make sense for any AWS Lambda instrumentation to offer an (off-by-default) configuration option use the X-Ray environment variable as preferred parent.

A Lambda function implementation should be able to use any framework and backend they would like to use. Just because X-Ray is enabled for a lambda does not mean the instrumentation must publish to X-Ray. OpenTelemetry trace with a parent created from an X-Ray context does not require X-Ray to be the backend system.

A backend system missing the AWS Lambda published spans will have broken trees for some lambda trigger types. Lambda invoke for example. If you don’t have the X-Ray spans the backend system will be missing the span that links the client’s invoke request to the Lambda’s invoke span. AWS Lambda adds the client’s request as a parent to the Lambda Context span. The Lambda Context span is published to X-Ray.

An implementation that wants traces and that is getting X-Ray no samples, is a lambda being called by a trigger that does not support X-Ray propagation and has tracing configuration set to passthrough

An implementation that is tracing while X-Ray is not sampling, is a lambda propagating the trace through the event payload. This is only possible in some of the triggers.

An implementation that is tracing and using a trigger that only supports X-Ray propagation can’t propagate the trace header using W3C, is using the X-Ray trace spans—not necessarily publishing all traces to X-Ray.

An “off-by-default” configuration is an implementation that is using a lambda trigger that does not support X-Ray propagation and propagating the trace in the event. This lambda invoke still has a Lambda context, Lambda function, initialization, invoke, and overhead spans generated by the AWS Lambda Service.

A discussion about X-Ray and OpenTelemetry is complicated because there is X-Ray as the tracing backend vs "X-Ray" the trace context propagation, and AWS Services (not user code) generated trace data published to X-Ray. A user may have a backend of their choice that is not X-Ray. However, they will still have to leverage "X-Ray" trace propagation and access X-Ray to get AWS Service-generated traces.

AWS NEEDS to improve its distributed tracing experience. Either get all services supporting X-Ray propagation, support OTEL trace propagation, AWS Services export X-Ray spans to collectors instead of publishing directly to X-Ray or X-Ray to have an exporter.

joaopgrassi commented 5 months ago

This was closed by mistake by the stale bot. Re-opening