open-telemetry / opentelemetry-js

OpenTelemetry JavaScript Client
https://opentelemetry.io
Apache License 2.0
2.77k stars 810 forks source link

DEADLINE_EXCEEDED errors in GCP CF #3602

Closed AkselAllas closed 1 year ago

AkselAllas commented 1 year ago

What happened?

I try to have a GCP Cloudfunction which flushes telemetry at end of it's code. I think some GRPC call is still being tried to be sent, but GCP removes network and we get DEADLINE_EXCEEDED

Steps to Reproduce

  try {
    functionCodeHere()
  } finally {
    await spanProcessor.forceFlush()
  }

Expected Result

We just get our spans

Actual Result

We get spans, but also:

Service request { resourceSpans: [ { resource: [Object], scopeSpans: [Array], schemaUrl: undefined } ] }

and

{"stack":"Error: 4 DEADLINE_EXCEEDED: Deadline exceeded\n at callErrorFromStatus (/layers/google.nodejs.yarn/yarn_modules/node_modules/@grpc/grpc-js/src/call.ts:81:17)
...

Additional Details

OpenTelemetry Setup Code

Pretty much this https://github.com/AkselAllas/repro-otel-bug But also add

    "@opentelemetry/instrumentation-grpc": "0.34.0",

and instrument it

package.json

No response

Relevant log output

{"stack":"Error: 4 DEADLINE_EXCEEDED: Deadline exceeded\n at callErrorFromStatus (/layers/google.nodejs.yarn/yarn_modules/node_modules/@grpc/grpc-js/src/call.ts:81:17)\n at Object.onReceiveStatus (/layers/google.nodejs.yarn/yarn_modules/node_modules/@grpc/grpc-js/src/client.ts:356:55)\n at Object.onReceiveStatus (/layers/google.nodejs.yarn/yarn_modules/node_modules/@grpc/grpc-js/src/client-interceptors.ts:455:34)\n at Object.onReceiveStatus (/layers/google.nodejs.yarn/yarn_modules/node_modules/@grpc/grpc-js/src/client-interceptors.ts:417:48)\n at /layers/google.nodejs.yarn/yarn_modules/node_modules/@grpc/grpc-js/src/resolving-call.ts:110:24\n at processTicksAndRejections (node:internal/process/task_queues:78:11)\n at runNextTicks (node:internal/process/task_queues:65:3)\n at processTimers (node:internal/timers:499:9)\nfor call at\n at ServiceClientImpl.makeUnaryRequest (/layers/google.nodejs.yarn/yarn_modules/node_modules/@grpc/grpc-js/src/client.ts:326:30)\n at ServiceClientImpl.<anonymous> (/layers/google.nodejs.yarn/yarn_modules/node_modules/@grpc/grpc-js/src/make-client.ts:189:15)\n at OTLPTraceExporter.send [as _send] (/layers/google.nodejs.yarn/yarn_modules/node_modules/@opentelemetry/otlp-grpc-exporter-base/src/util.ts:92:35)\n at /layers/google.nodejs.yarn/yarn_modules/node_modules/@opentelemetry/otlp-grpc-exporter-base/src/OTLPGRPCExporterNodeBase.ts:65:12\n at new Promise (<anonymous>)\n at OTLPTraceExporter._sendPromise (/layers/google.nodejs.yarn/yarn_modules/node_modules/@opentelemetry/otlp-grpc-exporter-base/src/OTLPGRPCExporterNodeBase.ts:64:21)\n at OTLPTraceExporter.send (/layers/google.nodejs.yarn/yarn_modules/node_modules/@opentelemetry/otlp-grpc-exporter-base/src/OTLPGRPCExporterNodeBase.ts:107:12)\n at /layers/google.nodejs.yarn/yarn_modules/node_modules/@opentelemetry/otlp-exporter-base/src/OTLPExporterBase.ts:99:14\n at new Promise (<anonymous>)\n at OTLPTraceExporter._export (/layers/google.nodejs.yarn/yarn_modules/node_modules/@opentelemetry/otlp-exporter-base/src/OTLPExporterBase.ts:96:12)","message":"4 DEADLINE_EXCEEDED: Deadline exceeded","code":"4","details":"Deadline exceeded","metadata":"[object Object]","name":"Error"}
AkselAllas commented 1 year ago

This issue is pretty much https://github.com/open-telemetry/opentelemetry-js/issues/1739

I guess I have to figure out what spans are being attempted to export.

AkselAllas commented 1 year ago

This is the span it's trying to export.

{
traceId: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx',
parentId: 'xxxxxxxxxxxxxxxxxxxx',
name: 'grpc.google.pubsub.v1.Publisher/Publish',
id: '72cc057f5cac5a4a',
kind: 2,
timestamp: 1676030779770028,
duration: 74799611,
attributes: {
'rpc.system': 'grpc',
'rpc.method': 'Publish',
'rpc.service': 'google.pubsub.v1.Publisher',
'rpc.grpc.status_code': '4',
'grpc.error_name': 'Error',
'grpc.error_message': '4 DEADLINE_EXCEEDED: Deadline exceeded'
},
status: { code: 2 },
events: [],
links: []
}

Related error and probably related issue

Exception from a finished function: Error: Total timeout of API google.pubsub.v1.Publisher exceeded 60000 milliseconds before any response was received.

So looks like it's pub-sub deadline exceeding, then GRPC instrumentation wanting to export it as a span 🤔 Working as intended? 🤔

AkselAllas commented 1 year ago

Is there a way of suppressing these errors? 🤔

aabmass commented 1 year ago

I try to have a GCP Cloudfunction which flushes telemetry at end of it's code.

Are you using Cloud Functions gen 1 or 2? Cloud Functions gen 2 is built on Cloud Run which supports graceful shutdown by capturing SIGTERM (example code). I'd recommend using a SIGTERM handler to call TracerProvider.shutdown() and otherwise not flush telemetry at the end of your code.

try {
    functionCodeHere()
  } finally {
    await spanProcessor.forceFlush()
  }

If you take the approach of calling TracerProvider.shutdown() once, all new traces will be dropped after that point (code). I believe that would fix your issue of spans being sent later than you intend?

We get spans, but also:

Service request { resourceSpans: [ { resource: [Object], scopeSpans: [Array], schemaUrl: undefined } ] }

and

{"stack":"Error: 4 DEADLINE_EXCEEDED: Deadline exceeded\n at callErrorFromStatus (/layers/google.nodejs.yarn/yarn_modules/node_modules/@grpc/grpc-js/src/call.ts:81:17)
...

I believe what's happening here is

  1. Call forceFlush(), which causes any already ended spans to be flushed
  2. Something attempts to use pubsub in the background. IIRC the PubSub NodeJS client will batch spans by default, which is probably what's happening here. This is configurable (docs).
  3. The PubSub call uses gRPC; since gRPC it is instrumented this creates another span.
  4. Independently PubSub request fails because, as you mentioned, the runtime has shutdown your network access or throttled your code.

So looks like it's pub-sub deadline exceeding, then GRPC instrumentation wanting to export it as a span 🤔 Working as intended? 🤔

Yes I think everything is WAI. To summarize, I'd recommend calling TracerProvider.shutdown() on SIGTERM and also tuning your PubSub batching depending on what your goal is here. You may also be able to flush pubsub messages in SIGTERM handler or at the end of your Cloud Function.

aabmass commented 1 year ago

Is there a way of suppressing these errors? 🤔

I'm not sure what you mean by suppressing. Are these just logs or runtime thrown exceptions? You can configure logs emitted from OpenTelemetry with the diag API https://opentelemetry.io/docs/instrumentation/js/getting-started/nodejs/#setup.

dyladan commented 1 year ago

Changing this from bug to question since it doesn't seem like we're directly causing this with a bug but it might be a usage issue.

aabmass commented 1 year ago

@AkselAllas are we OK to close this issue now?

AkselAllas commented 1 year ago

Looks like. I will still try to confirm that I can remove it via pubsub config changes.

Happened on CF v1.

AkselAllas commented 2 months ago

Stumbled on this issue again:

Root cause for these errors was both spanProcessor and metricExporter having 5sec or 15sec periodic export windows. Problem is when Cloud Function (Or also CPU detachable Cloud Run) detaches network and after that code still runs export function.

To fix this, we detect whether our code is running in possible detachable container and if it is, then we set export periods from 5/15 sec to 2147483,647 sec.

AkselAllas commented 2 months ago

@dyladan @aabmass If possible, ideally this info could be written to all sorts documentation spots and examples.