temporalio / sdk-typescript

Temporal TypeScript SDK
Other
545 stars 109 forks source link

[Feature Request] Add OTEL tracing information as attribute on log messages emitted using Workflow Logger #1425

Closed lucaskatayama closed 2 weeks ago

lucaskatayama commented 6 months ago

What are you really trying to do?

Get OpenTelemetry trace_id and span_id fields on Workflow logs.

Describe the bug

The tracing fields from OpenTelemetry (trace_id and span_id) is not appearing on Workflow logs specifically.

It works on Client and Activities.

Calling worflow.log.error:

{"label":"workflow","level":"error","message":"hello","namespace":"default","runId":"add5d842-ca36-4fee-8e51-624d7b2bb58c","sdkComponent":"workflow","taskQueue":"interceptors-opentelemetry-example","timestamp":1716328238734,"workflowId":"otel-example-0","workflowType":"example"}

Calling activity.log.error

{"activityId":"1","activityType":"greet","attempt":1,"isLocal":false,"label":"activity","level":"error","message":"alskjdbnkajsdbnkjandsjk","namespace":"default","sdkComponent":"activity","span_id":"10036d093b04a315","taskQueue":"interceptors-opentelemetry-example","taskToken":"CiRjYmI4ODI1My0wOWMyLTRiMzMtOWI3YS05MTQ5ODFhMzdjM2USDm90ZWwtZXhhbXBsZS0wGiRhZGQ1ZDg0Mi1jYTM2LTRmZWUtOGU1MS02MjRkN2IyYmI1OGMgBSgBMgExQgVncmVldEoICAEQxo5AGAE=","timestamp":1716328238737,"trace_flags":"01","trace_id":"f2731bb716c062624923e55b29f8fcec","workflowId":"otel-example-0","workflowRunId":"add5d842-ca36-4fee-8e51-624d7b2bb58c","workflowType":"example"}

Which gives "span_id":"10036d093b04a315" and "trace_id":"f2731bb716c062624923e55b29f8fcec".

Minimal Reproduction

  1. From interceptors-opentelemetry sample: https://github.com/temporalio/samples-typescript/tree/main/interceptors-opentelemetry
  2. Enable OpenTelemetry Auto Instrumentation for JS for winston (https://opentelemetry.io/docs/languages/js/automatic/)
export OTEL_NODE_RESOURCE_DETECTORS="env,host,os,serviceinstance"
export OTEL_SERVICE_NAME="client"
export NODE_OPTIONS="--require @opentelemetry/auto-instrumentations-node/register"
export OTEL_NODE_ENABLED_INSTRUMENTATIONS="winston,http"
  1. Enable custom-logger using https://github.com/temporalio/samples-typescript/tree/main/custom-logger
  2. Log on Workflow
import * as workflow from "@temporalio/workflow"

// A workflow that simply calls an activity
export async function example(name: string): Promise<string> {
  workflow.log.error("hello");
  return await greet(name);
}
  1. Log on Activity
    
    import * as activity from "@temporalio/activity"

export async function greet(name: string): Promise { activity.log.error("alskjdbnkajsdbnkjandsjk") return Hello, ${name}!; }


7. Call client to execute workflow (`npm run workflow`)

### Environment/Versions

<!-- Please complete the following information where relevant. -->

- OS and processor: M2 Mac
- Temporal Version: 

❯ temporal --version temporal version 0.12.0 (server 1.23.0) (ui 2.26.2)

- Are you using Docker or Kubernetes or building Temporal from source? No

### Additional context

- The trace_id and span_id are generated but they doesn't appear on logs and are being reported to newrelic correctly, because they appear on distributed tracing.

From `ConsoleSpanExporter`

{ resource: { attributes: { 'service.name': 'interceptors-sample-worker' } }, traceId: 'd498bbf00762a58fc78c022804644789', <<<<<<<<<< parentId: '38b7d260146c73e6', <<<<<<<< traceState: undefined, name: 'RunWorkflow:example', id: 'aaa81665fd7d1198', kind: 0, timestamp: 1716328235467000, duration: 46000, attributes: { ... { resource: { attributes: { 'service.name': 'interceptors-sample-worker' } }, traceId: 'd498bbf00762a58fc78c022804644789', parentId: 'aaa81665fd7d1198', traceState: undefined, name: 'StartActivity:greet', id: 'aec76af986522d29', kind: 0,



- I think is a problem with `winston` + `opentelemetry` on workflow context only, because it only occurs on Workflows... everywhere else is working.
- I put the code here https://github.com/lucaskatayama/temporal-otel
mjameswh commented 6 months ago

This is not a bug. It just can't work that way.

In the TS SDK, Workflows execute in a sandboxed environment. Among other things, that implies that Workflow Code can't do any I/O, and therefore, can't interact with a regular Node.js logger.

Instead, log messages emitted using the Workflow context logger are funneled through a Workflow Sink to the Runtime logger. Now, Sinks are an advanced feature of the TS SDK, but essentially, that means that messages are queued internally in the Workflow context for the duration of a Workflow Activation; at the end of an Activation, queued Sink Calls are passed from the Workflow Worker Thread, through a V8 MessagePort, back to the main SDK thread, where they get processed (in this case, passing them to the Runtime logger, which in your case would be an instance of Winston's Logger class).

Note that, at the time that the call is actually passed to the Winston's Logger, execution is no longer contained by the OpenTelemetry call scope, and therefore, the OpenTelemetry Auto Instrumentation can't attach these calls to the parent context.

Note also that the @opentelemetry/auto-instrumentations-node plugin can't reach into the Workflow sandbox, or get loaded there, as that plugin would bring in dependencies on various non-deterministic-safe modules.

The solution

What you can do instead is to register a WorkflowOutboundInterceptor on your Workflow Worker that intercepts the getLogAttributes function and injects the trace_id and span_id attributes from there. I think the following should work (untested):

export class OpenTelemetryTraceInLogsWorkflowOutboundInterceptor implements WorkflowOutboundCallsInterceptor {
  public getLogAttributes(
    input: GetLogAttributesInput,
    next: Next<WorkflowOutboundCallsInterceptor, 'getLogAttributes'>
  ): Record<string, unknown> {
    const span = otel.trace.getSpan(otel.context.active());
    const spanContext = span?.spanContext();
    if (spanContext && otel.isSpanContextValid(spanContext)) {
      return next({
        trace_id: spanContext.traceId,
        span_id: spanContext.spanId,
        trace_flags: `0${spanContext.traceFlags.toString(16)}`,
        ...input,
      });
    } else {
      return next(input);
    }
  }
}

I certainly agree that it would make sense to be done out of the box by our @temporalio/interceptors-opentelemetry Workflow Interceptor. I'll therefore keep this ticket open as a feature request.

lucaskatayama commented 6 months ago

This is not a bug. It just can't work that way.

In the TS SDK, Workflows execute in a sandboxed environment. Among other things, that implies that Workflow Code can't do any I/O, and therefore, can't interact with a regular Node.js logger.

Instead, log messages emitted using the Workflow context logger are funneled through a Workflow Sink to the Runtime logger. Now, Sinks are an advanced feature of the TS SDK, but essentially, that means that messages are queued internally in the Workflow context for the duration of a Workflow Activation; at the end of an Activation, queued Sink Calls are passed from the Workflow Worker Thread, through a V8 MessagePort, back to the main SDK thread, where they get processed (in this case, passing them to the Runtime logger, which in your case would be an instance of Winston's Logger class).

Note that, at the time that the call is actually passed to the Winston's Logger, execution is no longer contained by the OpenTelemetry call scope, and therefore, the OpenTelemetry Auto Instrumentation can't attach these calls to the parent context.

Note also that the @opentelemetry/auto-instrumentations-node plugin can't reach into the Workflow sandbox, or get loaded there, as that plugin would bring in dependencies on various non-deterministic-safe modules.

The solution

What you can do instead is to register a WorkflowOutboundInterceptor on your Workflow Worker that intercepts the getLogAttributes function and injects the trace_id and span_id attributes from there. I think the following should work (untested):

export class OpenTelemetryTraceInLogsWorkflowOutboundInterceptor implements WorkflowOutboundCallsInterceptor {
  public getLogAttributes(
    input: GetLogAttributesInput,
    next: Next<WorkflowOutboundCallsInterceptor, 'getLogAttributes'>
  ): Record<string, unknown> {
    const span = otel.trace.getSpan(otel.context.active());
    const spanContext = span?.spanContext();
    if (spanContext && otel.isSpanContextValid(spanContext)) {
      return next({
        input,
        trace_id: spanContext.traceId,
        span_id: spanContext.spanId,
        trace_flags: `0${spanContext.traceFlags.toString(16)}`,
      });
    } else {
      return next(input);
    }
  }
}

I certainly agree that it would make sense to be done out of the box by our @temporalio/interceptors-opentelemetry Workflow Interceptor. I'll therefore keep this ticket open as a feature request.

Yep... That WORKED @mjameswh !

I think that piece of code could be in @temporalio/interceptors-opentelemetry ... Or at least documented ...

Thanks!