open-telemetry / opentelemetry-js-contrib

OpenTelemetry instrumentation for JavaScript modules
https://opentelemetry.io
Apache License 2.0
700 stars 516 forks source link

SQS traceparent not used when running lambda #1430

Closed ImDevinC closed 1 year ago

ImDevinC commented 1 year ago

What version of OpenTelemetry are you using?

{
...
"dependencies": {
"@opentelemetry/api": "^1.4.1",
"@opentelemetry/auto-instrumentations-node": "^0.36.4",
"@opentelemetry/exporter-trace-otlp-grpc": "^0.35.1",
"@opentelemetry/resource-detector-aws": "^1.2.2",
"@opentelemetry/resources": "^1.10.0",
"@opentelemetry/sdk-trace-base": "^1.10.0",
"@opentelemetry/sdk-trace-node": "^1.10.0"
}
...

What version of Node are you using?

16

What did you do?

I'm running two lambdas in AWS that communicate through an SQS queue. Both lambdas are invoked with NODE_OPTIONS: --require tracing.js to use the same tracing instrumentation that loads ahead of the actual lambda code. Both functions properly create their own unique traces, and the SQS message that gets published does have a MessageAttribute of traceparent=<traceid>. However the consumer lambda (consumer.ts) does not pick up this traceparent when initializing and therefore doesn't attribute the trace properly to the producer.

tracing.ts

import { diag, DiagConsoleLogger, DiagLogLevel } from "@opentelemetry/api";
import { getNodeAutoInstrumentations } from "@opentelemetry/auto-instrumentations-node";
import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-grpc";
import { registerInstrumentations } from "@opentelemetry/instrumentation";
import { awsLambdaDetector } from "@opentelemetry/resource-detector-aws";
import {
  detectResourcesSync,
  envDetector,
  processDetector,
} from "@opentelemetry/resources";
import { BatchSpanProcessor } from "@opentelemetry/sdk-trace-base";
import { NodeTracerProvider } from "@opentelemetry/sdk-trace-node";

const initTracing = () => {
  diag.setLogger(new DiagConsoleLogger(), DiagLogLevel.DEBUG);
  const resource = detectResourcesSync({
    detectors: [awsLambdaDetector, envDetector, processDetector],
  });

  const provider = new NodeTracerProvider({
    resource: resource,
  });

  const exporter = new OTLPTraceExporter({
    url: 'localhost:4317',
  });

  provider.addSpanProcessor(new BatchSpanProcessor(exporter));

  registerInstrumentations({
    tracerProvider: provider,
    instrumentations: [
      getNodeAutoInstrumentations({
        "@opentelemetry/instrumentation-aws-lambda": {
          disableAwsContextPropagation: true,
        },
        "@opentelemetry/instrumentation-fs": {
          enabled: false, // This is very noisy, and at least on lambda isn't very helpful
        },
      }),
    ],
  });

  provider.register();
};

initTracing();

producer.ts

import { SpanStatusCode, trace } from "@opentelemetry/api";
import { APIGatewayEvent, APIGatewayProxyCallback, Context } from "aws-lambda";
import { SQS } from "aws-sdk";

const tracer = trace.getTracerProvider().getTracer("sample-apm-service");

const postMessage = (body: string) => {
  const sqs = new SQS();
  const req: SQS.SendMessageRequest = {
    QueueUrl: process.env.SQS_URL!,
    MessageBody: body,
  };
  return new Promise<number>((resolve, reject) => {
    const resp = sqs.sendMessage(req, (res) => {
      resolve(res?.statusCode ?? 0);
    });
    resp.on("error", (err) => {
      reject(err);
    });
  });
};

const doWork = async (accountId: string) => {
  return tracer.startActiveSpan("doWork", async (span) => {
    span.setAttribute("accountId", accountId);
    await new Promise((resolve) => setTimeout(resolve, 5000));
    try {
      failedSpan();
    } catch (error) {
      span.recordException(error);
      span.setStatus({
        code: SpanStatusCode.ERROR,
      });
    }
    span.end();
  });
};

const failedSpan = () => {
  const span = tracer.startSpan("failedSpan");
  try {
    throw new Error("expected failure");
  } catch (error) {
    span.recordException(error);
    span.setStatus({
      code: SpanStatusCode.ERROR,
    });
    span.end();
    throw error;
  }
  span.end();
};

export const handler = async (
  event: APIGatewayEvent,
  context: Context,
  callback: APIGatewayProxyCallback
) => {
  const accountId = "unknown";
  try {
    await doWork(accountId);
    const result = await postMessage(accountId);
    console.log(`StatusCode: ${result}`);
    return {
      statusCode: result,
    };
  } catch (error) {
    console.log(error);
    return {
      statusCode: 400,
      body: error.message,
    };
  }
}

consumer.ts

import { trace } from "@opentelemetry/api";
import { Context, SQSEvent } from "aws-lambda";

const tracer = trace.getTracerProvider().getTracer("message-consumer");

const doWork = async (accountId: string) => {
  const span = tracer.startSpan("dowork");
  console.log("doing event", accountId);
  span.setAttribute("accountId", accountId);
  await new Promise((resolve) => setTimeout(resolve, 5000));
  span.end();
};

export const handler = async (event: SQSEvent, context: Context) => {
  for (const record of event.Records) {
    await tracer.startActiveSpan("handler",  async (span) => {
        await doWork(record.body);
        span.end();
      }
    );
  }
};

What did you expect to see?

I would expect to see both the consumer and parent trace properly linked together.

What did you see instead?

This produces two distinct traces that are not linked together.

dyladan commented 1 year ago

@willarmiros can you look at this since you're the component owner? I think this is more of a feature request to support trace headers in sqs message attributes than it is a bug.

ImDevinC commented 1 year ago

Thanks @dyladan. I filed as a bug because a trace header is already added to sqs message attributes automatically, but maybe I'm misunderstanding it's purpose

dyladan commented 1 year ago

Added by the sqs instrumentation or by sqs itself? I think the messaging group typically recomends using links instead of parent span

ImDevinC commented 1 year ago

I'm 99% sure it's the sqs instrumentation adding the context, as I'm not specifying any message attributes when I build the message in producer.ts

  const req: SQS.SendMessageRequest = {
    QueueUrl: process.env.SQS_URL!,
    MessageBody: body,
  };
  return new Promise<number>((resolve, reject) => {
    const resp = sqs.sendMessage(req, (res) => {
      resolve(res?.statusCode ?? 0);
    });
willarmiros commented 1 year ago

I think the messaging group typically recommends using links instead of parent span

This is correct, since it is possible for several SQS messages to be batched into a single invocation of a Lambda function, the correct behavior per the spec is to start a new trace in the Lambda function (or use the Lambda's X-Ray trace context since it is always a new trace) and link it to all upstream SQS Message trace(s).

I'm 99% sure it's the sqs instrumentation adding the context

I believe both the Lambda AND SQS instrumentations need to be updated to comply with this behavior. Lambda is a special case because the handler is provided the SQS Messages directly in its payload, whereas other applications that are pulling from SQS directly use the SQS client. So both the Lambda handler and SQS Client are possible entry points for this instrumentation.

FWIW, we are planning on implementing this behavior in the coming months.

ImDevinC commented 1 year ago

Thanks, the linking is what I was missing. Appreciate all the insight, and glad to hear that autoinstrumentation of this is on the roadmap. I'll close this since you solved my issue