pragmaticivan / nestjs-otel

OpenTelemetry (Tracing + Metrics) module for Nest framework (node.js) 🔭
Apache License 2.0
556 stars 49 forks source link

Distributed tracing correlation issue. #474

Closed ebadfd closed 5 months ago

ebadfd commented 7 months ago

Hey I do have quite similar issue to #266. My other services are sending the traceparent header and I'm using W3CTraceContextPropagator But it does not seems like the spans are correlated to the traceId from traceparent header. Instead the initial trace and the spans are created as new traces

I have attached the tracer.ts file

import { BatchSpanProcessor } from '@opentelemetry/sdk-trace-base';
import { NodeSDK } from '@opentelemetry/sdk-node';
import { AsyncLocalStorageContextManager } from '@opentelemetry/context-async-hooks';
import * as process from 'process';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import {
  CompositePropagator,
  W3CTraceContextPropagator,
  W3CBaggagePropagator,
} from '@opentelemetry/core';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import {
  SemanticResourceAttributes,
  TelemetrySdkLanguageValues,
} from '@opentelemetry/semantic-conventions';
import { Resource } from '@opentelemetry/resources';

// import { PrometheusExporter } from '@opentelemetry/exporter-prometheus';

//const metricReader = new PrometheusExporter({
//  port: 8081,
//});

const traceExporter = new OTLPTraceExporter({
  url: `http://otel-collector.tracing.svc:4318/v1/traces`,
});

const serviceName = 'api';

const spanProcessor = new BatchSpanProcessor(traceExporter);

const otelSDK = new NodeSDK({
  //  metricReader,
  resource: new Resource({
    [SemanticResourceAttributes.SERVICE_NAME]: serviceName,
    [SemanticResourceAttributes.TELEMETRY_SDK_LANGUAGE]: TelemetrySdkLanguageValues.NODEJS,
  }),
  spanProcessor: spanProcessor,
  contextManager: new AsyncLocalStorageContextManager(),
  instrumentations: [
    getNodeAutoInstrumentations({
      '@opentelemetry/instrumentation-fs': { enabled: false },
      '@opentelemetry/instrumentation-dns': { enabled: false },
      '@opentelemetry/instrumentation-net': { enabled: false },
      '@opentelemetry/instrumentation-winston': {
        enabled: true,
        logHook: (span, record) => {
          record['resource.service.name'] = serviceName;
        },
      },
      '@opentelemetry/instrumentation-http': { enabled: true },
      '@opentelemetry/instrumentation-express': { enabled: true },
      '@opentelemetry/instrumentation-aws-sdk': {
        enabled: true,
      },
      '@opentelemetry/instrumentation-redis-4': { enabled: true },
      '@opentelemetry/instrumentation-mongoose': { enabled: true },
    }),
  ],
  textMapPropagator: new CompositePropagator({
    propagators: [new W3CTraceContextPropagator(), new W3CBaggagePropagator()],
  }),
});

export default otelSDK;
// You can also use the shutdown method to gracefully shut down the SDK before process shutdown
// or on some operating system signal.
process.on('SIGTERM', () => {
  otelSDK
    .shutdown()
    .then(
      () => console.log('SDK shut down successfully'),
      (err) => console.log('Error shutting down SDK', err)
    )
    .finally(() => process.exit(0));
});

This is how the traces looks like atm.

Screenshot 2024-04-07 at 00 51 11

Screenshot 2024-04-07 at 00 52 57

pragmaticivan commented 5 months ago

Hi there, could you ensure both services are propagating the same way? I have a POC here https://github.com/pragmaticivan/nestjs-otel-prom-grafana-tempo/blob/main/services/actor/src/instrumentation.ts#L32-L41 which has 2 services and they indeed propagate properly.

pragmaticivan commented 5 months ago

I see from the screenshot you are using Istio.

It for that, you also need B3 propagation, check the link above.

Also check this blog post: https://www.aspecto.io/blog/opentelemetry-and-istio-everything-you-need-to-know/

(Note that OpenTelemetry uses, by default, the W3C context propagation specification, while Istio uses the B3 context propagation specification – this can be modified).

ebadfd commented 4 months ago

Hi @pragmaticivan Sorry about the late reply. I have tested this out. still looks like the istio trace id is not getting assigned as the parent trace.

items to be sent [
  Span {
    attributes: { thisAttribute: 'this is a value set manually' },
    links: [],
    events: [ [Object] ],
    _droppedAttributesCount: 0,
    _droppedEventsCount: 0,
    _droppedLinksCount: 0,
    status: { code: 0 },
    endTime: [ 1718866376, 993068479 ],
    _ended: true,
    _duration: [ 0, 68479 ],
    name: 'healthCheck',
    _spanContext: {
      traceId: 'a813adbbf88fc19c7a359a55d270aca1',
      spanId: 'eb0f53ee9ed3ce9d',
      traceFlags: 1,
      traceState: undefined
    },
    parentSpanId: undefined,
    kind: 0,
    _performanceStartTime: 24690.612481,
    _performanceOffset: -0.0185546875,
    _startTimeProvided: false,
    startTime: [ 1718866376, 993000000 ],
    resource: Resource {
      _attributes: [Object],
      asyncAttributesPending: false,
      _syncAttributes: [Object],
      _asyncAttributesPromise: [Promise]
    },
    instrumentationLibrary: { name: 'basic', version: undefined, schemaUrl: undefined },
    _spanLimits: {
      attributeValueLengthLimit: Infinity,
      attributeCountLimit: 128,
      linkCountLimit: 128,
      eventCountLimit: 128,
      attributePerEventCountLimit: 128,
      attributePerLinkCountLimit: 128
    },
    _spanProcessor: MultiSpanProcessor { _spanProcessors: [Array] },
    _attributeValueLengthLimit: Infinity
  }
]

This is a sample span below is the request this services got.

 curl -H 'traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01' http://localhost:4003/otel-test

and I have updated my config as well below is how it looks now

import { BatchSpanProcessor } from '@opentelemetry/sdk-trace-base';
import { NodeSDK } from '@opentelemetry/sdk-node';
import * as process from 'process';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import {
  SemanticResourceAttributes,
  TelemetrySdkLanguageValues,
} from '@opentelemetry/semantic-conventions';
import { Resource } from '@opentelemetry/resources';
import { UndiciInstrumentation } from '@opentelemetry/instrumentation-undici';
import {
  CompositePropagator,
  W3CTraceContextPropagator,
  W3CBaggagePropagator,
} from '@opentelemetry/core';
import { AsyncLocalStorageContextManager } from '@opentelemetry/context-async-hooks';

const traceExporter = new OTLPTraceExporter({
  //url: `http://otel-collector.tracing.svc:4318/v1/traces`,
  url: 'http://localhost:4318/v1/traces',
});

const serviceName = 'slocoach-api';

const spanProcessor = new BatchSpanProcessor(traceExporter);

const otelSDK = new NodeSDK({
  spanProcessor: spanProcessor,
  contextManager: new AsyncLocalStorageContextManager(),
  instrumentations: [getNodeAutoInstrumentations(), new UndiciInstrumentation()],
  textMapPropagator: new CompositePropagator({
    propagators: [new W3CTraceContextPropagator(), new W3CBaggagePropagator()],
  }),
});

export default otelSDK;
// You can also use the shutdown method to gracefully shut down the SDK before process shutdown
// or on some operating system signal.
process.on('SIGTERM', () => {
  otelSDK
    .shutdown()
    .then(
      () => console.log('SDK shut down successfully'),
      (err) => console.log('Error shutting down SDK', err)
    )
    .finally(() => process.exit(0));
});
pragmaticivan commented 4 months ago

@z9fr

textMapPropagator: new CompositePropagator({
    propagators: [new W3CTraceContextPropagator(), new W3CBaggagePropagator()],
  }),

^ this is the problem, you also need b3, which is required by istio in order to propagate it.

That's what you want instead: https://github.com/pragmaticivan/nestjs-otel-prom-grafana-tempo/blob/main/services/movie/src/instrumentation.ts#L32-L41

ebadfd commented 4 months ago

Hey @pragmaticivan thank you so much for the help. yes this seems to be working