open-telemetry / opentelemetry-js

OpenTelemetry JavaScript Client
https://opentelemetry.io
Apache License 2.0
2.65k stars 765 forks source link

BatchSpanProcessor doesn't re-enable tracing after flushing #4922

Open alexandros-megas opened 1 month ago

alexandros-megas commented 1 month ago

What happened?

Steps to Reproduce

I'm instrumenting a next.js node app, and in it, I'm using the BatchSpanProcessor, since the docs seemed to recommend using it for performance reasons.

We're using node's require hook environment variable to ensure that this module is executed first before anything else in next.js runs. Our helm chart includes the following in the env section:

NODE_OPTIONS: --require ./server/instrumentation.js

In one of my page's getServerSideProps function, I'm wrapping the data fetching logic in a span like the following:

  return getTracer().startActiveSpan('ExplorePage/getServerSideProps', async (span) => {
    span.setAttributes({ path: ctx.resolvedUrl });

    // fetch a bunch of data...
    const data = await fetchData();

    const result = {
      props: {
        data,
        traceProps: {
          spanId: span.spanContext().spanId,
          traceId: span.spanContext().traceId,
        },
      },
    };
    span.end();
    return result;
  });

Expected Result

I would expect the span processor to batch up some spans, and then send them to the exporter.

Actual Result

Seems like nothing is happening. No spans are sent to our collector endpoint.

Additional Details

I set the OTEL diag level to ALL so that I can see everything happening, and I noticed this line:

Instrumentation suppressed, returning Noop Span

After doing some investigation, I noticed that BatchSpanProcessor suppresses tracing (presumably) so that its own requests don't emit additional spans, but I don't see anywhere that it gets re-enabled at the end. Shouldn't it be re-enabling tracing so that other spans work correctly?

OpenTelemetry Setup Code

const util = require('util');
const { diag, DiagLogLevel, DiagConsoleLogger } = require('@opentelemetry/api');
const process = require('process');
const opentelemetry = require('@opentelemetry/sdk-node');
const { NodeTracerProvider, BatchSpanProcessor } = require('@opentelemetry/sdk-trace-node');
const { SEMRESATTRS_SERVICE_NAME, SEMRESATTRS_SERVICE_VERSION } = require('@opentelemetry/semantic-conventions');
const { Resource } = require('@opentelemetry/resources');
const { HttpInstrumentation } = require('@opentelemetry/instrumentation-http');
const { ExpressInstrumentation } = require('@opentelemetry/instrumentation-express');
const { registerInstrumentations } = require('@opentelemetry/instrumentation');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-http');
const { publicRuntimeConfig: configuration } = require('./configuration');

const traceEndpoint = process.env.OTEL_TRACE_EXPORTER_ENDPOINT;

/**
 * Set default options for util.inspect, so that console logging of objects
 * prints on a single line, and doesn't truncate anything.
 */
util.inspect.defaultOptions = {
  ...(util.inspect.defaultOptions || {}),
  breakLength: Number.POSITIVE_INFINITY,
  compact: true,
};

if (!traceEndpoint) {
  console.info('OTEL_TRACE_EXPORTER_ENDPOINT not set. Skipping OTEL SDK registration');
}
else {
  console.info('Registering OTEL SDK...');

  diag.setLogger(new DiagConsoleLogger(), DiagLogLevel.DEBUG);

  registerInstrumentations({
    instrumentations: [
      new HttpInstrumentation(),
      new ExpressInstrumentation(),
    ],
  });

  const traceExporter = new OTLPTraceExporter({
    url: traceEndpoint,
    headers: {},
  });
  console.info(`OTLPTraceExporter created with url ${traceEndpoint}`);

  const resource = Resource.default().merge(
    new Resource({
      [SEMRESATTRS_SERVICE_NAME]: configuration.NAME,
      [SEMRESATTRS_SERVICE_VERSION]: configuration.VERSION,
    }),
  );

  // Create and register an SDK
  const sdk = new opentelemetry.NodeSDK({
    serviceName: resource.attributes[SEMRESATTRS_SERVICE_NAME],
    autoDetectResources: true,
    resource,
    traceExporter,
  });

  // Create a provider for activating and tracking spans
  const tracerProvider = new NodeTracerProvider({
    resource,
  });

  const spanProcessor = new BatchSpanProcessor(traceExporter);

  tracerProvider.addSpanProcessor(spanProcessor);

  sdk.start();

  // Handle shutdown
  const shutdown = () => {
    return sdk
      .shutdown().then(
        () => console.log('SDK shut down successfully'),
        (err) => console.log('Error shutting down SDK', err),
      ).finally(() => process.exit(0));
  };
  process.on('exit', shutdown);
  process.on('SIGINT', shutdown);
  process.on('SIGTERM', shutdown);

  console.info('OTEL SDK Ready.');
}

package.json

"@opentelemetry/api": "^1.9.0",
    "@opentelemetry/exporter-metrics-otlp-http": "^0.52.1",
    "@opentelemetry/exporter-trace-otlp-http": "^0.52.1",
    "@opentelemetry/instrumentation": "^0.52.1",
    "@opentelemetry/instrumentation-express": "^0.41.0",
    "@opentelemetry/instrumentation-http": "^0.52.1",
    "@opentelemetry/resources": "^1.25.1",
    "@opentelemetry/sdk-node": "^0.52.1",
    "@opentelemetry/sdk-trace-base": "^1.25.1",
    "@opentelemetry/sdk-trace-node": "^1.25.1",
    "@opentelemetry/semantic-conventions": "^1.25.1",

Relevant log output

$ next start -p 3002
Registering OTEL SDK...
@opentelemetry/api: Registered a global for diag v1.9.0.
OTLPTraceExporter created with url http://otel-collector:14268/api/traces?format=jaeger.thrift
@opentelemetry/api: Registered a global for trace v1.9.0.
@opentelemetry/api: Registered a global for context v1.9.0.
@opentelemetry/api: Registered a global for propagation v1.9.0.
OTEL SDK Ready.
@opentelemetry/instrumentation-http Applying instrumentation patch for nodejs core module on require hook { module: 'http' }
@opentelemetry/instrumentation-http Applying instrumentation patch for nodejs core module on require hook { module: 'https' }
Accessing resource attributes before async attributes settled
HostDetector found resource. Resource { _attributes: { 'host.name': 'clarifai-web-2952-746648b8cf-x5gqh', 'host.arch': 'amd64' }, asyncAttributesPending: false, _syncAttributes: {}, _asyncAttributesPromise: Promise { { 'host.name': 'clarifai-web-2952-746648b8cf-x5gqh', 'host.arch': 'amd64' } } }
ProcessDetector found resource. Resource { _attributes: { 'process.pid': 29, 'process.executable.name': '/usr/local/bin/node', 'process.executable.path': '/usr/local/bin/node', 'process.command_args': [ '/usr/local/bin/node', '/app/node_modules/.bin/next', 'start', '-p', '3002' ], 'process.runtime.version': '20.16.0', 'process.runtime.name': 'nodejs', 'process.runtime.description': 'Node.js', 'process.command': '/app/node_modules/.bin/next', 'process.owner': 'root' }, asyncAttributesPending: false, _syncAttributes: {}, _asyncAttributesPromise: Promise { { 'process.pid': 29, 'process.executable.name': '/usr/local/bin/node', 'process.executable.path': '/usr/local/bin/node', 'process.command_args': [Array], 'process.runtime.version': '20.16.0', 'process.runtime.name': 'nodejs', 'process.runtime.description': 'Node.js', 'process.command': '/app/node_modules/.bin/next', 'process.owner': 'root' } } }
EnvDetector found resource. Resource { _attributes: {}, asyncAttributesPending: false, _syncAttributes: {}, _asyncAttributesPromise: Promise { {} } }
ready - started server on 0.0.0.0:3002, url: http://localhost:3002
error reading machine id: Error: ENOENT: no such file or directory, open '/etc/machine-id'
error reading machine id: Error: ENOENT: no such file or directory, open '/var/lib/dbus/machine-id'
@opentelemetry/instrumentation-http http instrumentation incomingRequest
@opentelemetry/instrumentation-http http instrumentation incomingRequest
@opentelemetry/instrumentation-http http instrumentation incomingRequest
items to be sent [ Span { attributes: { 'http.url': 'http://10.1.42.90:3002/explore', 'http.host': '10.1.42.90:3002', 'net.host.name': '10.1.42.90', 'http.method': 'GET', 'http.scheme': 'http', 'http.target': '/explore', 'http.user_agent': 'ELB-HealthChecker/2.0', 'http.flavor': '1.1', 'net.transport': 'ip_tcp', 'http.status_code': 200, 'http.status_text': '' }, links: [], events: [], _droppedAttributesCount: 0, _droppedEventsCount: 0, _droppedLinksCount: 0, status: { code: 0 }, endTime: [ 1723654454, 455285129 ], _ended: true, _duration: [ 0, 9285129 ], name: 'GET', _spanContext: { traceId: 'e6efceeef07848fe435b3a3592832e43', spanId: 'dcb8621e4d4afe31', traceFlags: 1, traceState: undefined }, parentSpanId: undefined, kind: 1, _performanceStartTime: 9397.318576, _performanceOffset: -0.1025390625, _startTimeProvided: false, startTime: [ 1723654454, 446000000 ], resource: Resource { _attributes: [Object], asyncAttributesPending: false, _syncAttributes: [Object], _asyncAttributesPromise: [Promise] }, instrumentationLibrary: { name: '@opentelemetry/instrumentation-http', version: '0.52.1', schemaUrl: undefined }, _spanLimits: { attributeValueLengthLimit: Infinity, attributeCountLimit: 128, linkCountLimit: 128, eventCountLimit: 128, attributePerEventCountLimit: 128, attributePerLinkCountLimit: 128 }, _attributeValueLengthLimit: Infinity, _spanProcessor: MultiSpanProcessor { _spanProcessors: [Array] } }, Span { attributes: { 'http.url': 'http://10.1.42.90:3002/explore', 'http.host': '10.1.42.90:3002', 'net.host.name': '10.1.42.90', 'http.method': 'GET', 'http.scheme': 'http', 'http.target': '/explore', 'http.user_agent': 'ELB-HealthChecker/2.0', 'http.flavor': '1.1', 'net.transport': 'ip_tcp', 'http.status_code': 200, 'http.status_text': '' }, links: [], events: [], _droppedAttributesCount: 0, _droppedEventsCount: 0, _droppedLinksCount: 0, status: { code: 0 }, endTime: [ 1723654461, 147016249 ], _ended: true, _duration: [ 6, 694016249 ], name: 'GET', _spanContext: { traceId: '2d4d29a1b4da6f496e2f95ef79522347', spanId: 'c0a848223d4d26de', traceFlags: 1, traceState: undefined }, parentSpanId: undefined, kind: 1, _performanceStartTime: 9404.422502, _performanceOffset: -0.20654296875, _startTimeProvided: false, startTime: [ 1723654454, 453000000 ], resource: Resource { _attributes: [Object], asyncAttributesPending: false, _syncAttributes: [Object], _asyncAttributesPromise: [Promise] }, instrumentationLibrary: { name: '@opentelemetry/instrumentation-http', version: '0.52.1', schemaUrl: undefined }, _spanLimits: { attributeValueLengthLimit: Infinity, attributeCountLimit: 128, linkCountLimit: 128, eventCountLimit: 128, attributePerEventCountLimit: 128, attributePerLinkCountLimit: 128 }, _attributeValueLengthLimit: Infinity, _spanProcessor: MultiSpanProcessor { _spanProcessors: [Array] } } ]
Instrumentation suppressed, returning Noop Span
@opentelemetry/instrumentation-http http instrumentation outgoingRequest
@opentelemetry/instrumentation-http http.ClientRequest return request
@opentelemetry/instrumentation-http outgoingRequest on response()
@opentelemetry/instrumentation-http outgoingRequest on end()
@opentelemetry/instrumentation-http outgoingRequest on request close()
Bad Request
@opentelemetry/instrumentation-http http instrumentation incomingRequest
@opentelemetry/instrumentation-http http.ClientRequest return request
@opentelemetry/instrumentation-http https instrumentation outgoingRequest
@opentelemetry/instrumentation-http http instrumentation incomingRequest
@opentelemetry/instrumentation-http http.ClientRequest return request
@opentelemetry/instrumentation-http https instrumentation outgoingRequest
@opentelemetry/instrumentation-http http instrumentation incomingRequest
@opentelemetry/instrumentation-http https instrumentation outgoingRequest
@opentelemetry/instrumentation-http http.ClientRequest return request
@opentelemetry/instrumentation-http http instrumentation incomingRequest
@opentelemetry/instrumentation-http http.ClientRequest return request
@opentelemetry/instrumentation-http https instrumentation outgoingRequest
@opentelemetry/instrumentation-http http instrumentation incomingRequest
@opentelemetry/instrumentation-http http.ClientRequest return request
@opentelemetry/instrumentation-http https instrumentation outgoingRequest
@opentelemetry/instrumentation-http http instrumentation incomingRequest
@opentelemetry/instrumentation-http http.ClientRequest return request
@opentelemetry/instrumentation-http https instrumentation outgoingRequest
@opentelemetry/instrumentation-http http instrumentation incomingRequest
@opentelemetry/instrumentation-http https instrumentation outgoingRequest
@opentelemetry/instrumentation-http http.ClientRequest return request
@opentelemetry/instrumentation-http outgoingRequest on response()
@opentelemetry/instrumentation-http outgoingRequest on end()
@opentelemetry/instrumentation-http http instrumentation incomingRequest
@opentelemetry/instrumentation-http https instrumentation outgoingRequest
# ... and so on
alexandros-megas commented 1 month ago

One small note regarding the logs, some lines might be out of order, since they are collected by our cluster and aggregated in our k8s opensearch operator for searching/etc.

trentm commented 1 month ago

but I don't see anywhere that it gets re-enabled at the end. Shouldn't it be re-enabling tracing so that other spans work correctly?

The suppressTracing call only impacts the async task for the batched OTLP export request. It doesn't have a global impact on the whole process. This is very likely not the cause of the issue you are hitting.

trentm commented 1 month ago
items to be sent [ Span { attributes: { 'http.url': 'http://10.1.42.90:3002/explore', 

That items to be sent ... log line is from OTLPExporterBase at least attempting to send some tracing data on. Can you confirm you are getting some spans sent to your backend?

And then if so, is the issue that you don't get any spans sent after that initial send?

I'm not sure what that Bad Request log line is: whether it is from the OTel libs or your own code. It could be fine, I'm just not sure.


In one of my page's getServerSideProps function, I'm wrapping the data fetching logic in a span like the following:

  return getTracer()

On a hunch, could you console.log that tracer? E.g. use somethign like:

const tracer = getTracer()
console.log('XXX the tracer', tracer);
return tracer.startActiveSpan(...

That might help clarify if there is an actual tracer there, or a NoopTracer because of some surprise.


Next.js often bundles up the code being executed. OpenTelemetry support for bundled code ranges from extremely limited to totally broken. For example, the core node http module will be instrumented, but not any other pacakges (like a database client or whatever). That might not matter to you.


Next.js has native OpenTelemetry support: https://nextjs.org/docs/app/building-your-application/optimizing/open-telemetry Are you enabling any of that? Their registerOTel() will register a global tracing provider that would potentially conflict with a separate one that you have installed.

alexandros-megas commented 1 month ago

Hi Trent, thanks for your response. To answer some of your questions/points:

  1. I can confirm we are receiving NO traces in our backend.
  2. We did not use @vercel/otel because in the self hosting section of their documentation, they seem to recommend creating a custom configuration module anyway.
  3. I will insert some logging around the tracer that is returned from our getTracer() utility, and post back my results.

Thanks!

alexandros-megas commented 1 month ago

When I log the tracer, I see the following output. It does look like it's a real tracer that isn't a Noop tracer.

Acquired tracer: <ref *1> Tracer {
  _tracerProvider: NodeTracerProvider {
    _registeredSpanProcessors: [ [BatchSpanProcessor] ],
    _tracers: Map(2) {
      '@opentelemetry/instrumentation-http@0.52.1:' => [Tracer],
      'clarifai-web@0.1.0:' => [Circular *1]
    },
    resource: Resource {
      _attributes: [Object],
      asyncAttributesPending: false,
      _syncAttributes: [Object],
      _asyncAttributesPromise: [Promise]
    },
    _config: {
      sampler: [ParentBasedSampler],
      forceFlushTimeoutMillis: 30000,
      generalLimits: [Object],
      spanLimits: [Object],
      serviceName: 'clarifai-web',
      autoDetectResources: true,
      resource: [Resource],
      traceExporter: [OTLPTraceExporter]
    },
    activeSpanProcessor: MultiSpanProcessor {
      _spanProcessors: [Array]
    }
  },
  _sampler: ParentBasedSampler {
    _root: AlwaysOnSampler {},
    _remoteParentSampled: AlwaysOnSampler {},
    _remoteParentNotSampled: AlwaysOffSampler {},
    _localParentSampled: AlwaysOnSampler {},
    _localParentNotSampled: AlwaysOffSampler {}
  },
  _generalLimits: {
    attributeValueLengthLimit: Infinity,
    attributeCountLimit: 128
  },
  _spanLimits: {
    attributeValueLengthLimit: Infinity,
    attributeCountLimit: 128,
    linkCountLimit: 128,
    eventCountLimit: 128,
    attributePerEventCountLimit: 128,
    attributePerLinkCountLimit: 128
  },
  _idGenerator: RandomIdGenerator {
    generateTraceId: [Function: generateId],
    generateSpanId: [Function: generateId]
  },
  resource: Resource {
    _attributes: {
      'service.name': 'clarifai-web',
      'telemetry.sdk.language': 'nodejs',
      'telemetry.sdk.name': 'opentelemetry',
      'telemetry.sdk.version': '1.25.1',
      'service.version': '0.1.0',
      'process.pid': 29,
      'process.executable.name': '/usr/local/bin/node',
      'process.executable.path': '/usr/local/bin/node',
      'process.command_args': [Array],
      'process.runtime.version': '20.16.0',
      'process.runtime.name': 'nodejs',
      'process.runtime.description': 'Node.js',
      'process.command': '/app/node_modules/.bin/next',
      'process.owner': 'root',
      'host.name': 'clarifai-web-2952-6747bd448-fvtbv',
      'host.arch': 'amd64'
    },
    asyncAttributesPending: false,
    _syncAttributes: {
      'service.name': 'clarifai-web',
      'telemetry.sdk.language': 'nodejs',
      'telemetry.sdk.name': 'opentelemetry',
      'telemetry.sdk.version': '1.25.1',
      'service.version': '0.1.0'
    },
    _asyncAttributesPromise: Promise { [Object] }
  },
  instrumentationLibrary: {
    name: 'clarifai-web',
    version: '0.1.0',
    schemaUrl: undefined
  }
}

I did my best to manually format it, because it was all printed on a single line.

alexandros-megas commented 1 month ago

Side note, I also tried swapping the SimpleSpanProcessor for the BatchSpanProcessor so I guess that's +1 on the BSP not being the root cause of the issue.

trentm commented 3 weeks ago

I just notcied this in your original log:

OTLPTraceExporter created with url http://otel-collector:14268/api/traces?format=jaeger.thrift

Is that right? The jaeger and thrift references seem wrong. Do you possibly have the OTEL_TRACE_EXPORTER_ENDPOINT envvar pointing to a collector that expects something other than OTLP?

Port 14268 is a typical Jaeger port. My guess is you are sending OTLP/HTTP+JSON to Jaeger and it doesn't like it. That might explain the "Bad Request" log output -- though it is unfortunate that whatever code is logging that does not give some context.