temporalio / features

Behavior and history compatibility testing for Temporal SDKs
14 stars 17 forks source link

[Bug] Incomplete Span Reporting: Only RunActivity Spans Sent When Workflow Triggered via scheduler #422

Open alonwengierko opened 9 months ago

alonwengierko commented 9 months ago

What are you really trying to do?

I'm aiming to establish comprehensive tracing observability for temporal workflows initiated via scheduling.

Describe the bug

When the worfklow is triggered using create_schedule, only RunActivity spans are captured. However when it is triggered manually using start_workflow or execute_workflow, all spans are captured.

Tracing capture using jaeger for both manual and scheduled workflow executions. It shows workflow triggered manually with 5 spans, and the workflows triggered using schedule with 1 span.

Screenshot 2024-02-22 at 10 20 45

The screenshot below show that the only span capture for the scheduled workflows is RunActivity

Screenshot 2024-02-22 at 10 19 02

However for the workflows triggered manually using starter.py, we can see 5 spans captured. It is seen 5 spans: StartWorkflow, RunWorkflow, StartActivity, RunActivity and CompleteWorkflow

Screenshot 2024-02-22 at 10 18 54

Minimal Reproduction

  1. Clone project https://github.com/temporalio/samples-python/tree/main/open_telemetry
  2. Run starter.py (execute workflow at demand)
  3. Run scheduler.py (execute workflow by scheduler) zip file attached to this issue
  4. Observe the difference of captured spans. The spans generated by the workflow triggered manually is 5, while for the workflow triggered using create_schedule is 1. It should be the same for both.

Environment/Versions

Additional context

scheduler.py.zip

cretz commented 9 months ago

Is this a duplicate of #394?

alonwengierko commented 9 months ago

Is this a duplicate of #394?

@cretz, it appears there's a distinction between the issues. The bug highlighted in this particular matter concerns the omission of certain spans when a workflow is executed solely by Schedule. Notably, spans such as StartActivity, StartWorkflow, RunWorkflow, StartChildWorkflows, among others, are not being captured, with only the RunActivity span being recorded. On the other hand, #394 suggests that the Schedule creation should indeed generate a tracing span.

cretz commented 9 months ago

I see. There is some overlap because the overarching issue is that the workflows need an outer span from the client side to have a parent (whether that's schedule create or workflow start). The question becomes, how do you do that? You don't have an interceptor to apply per client-workflow-start for a schedule, it's started server side, and one large outer span for a schedule would get huge and live forever.

Simply put, spans are applied SDK side, and there is no SDK side interception for starting workflows from a schedule. The server does not start spans for a user. Open to suggestions here.

alonwengierko commented 9 months ago

I see. There is some overlap because the overarching issue is that the workflows need an outer span from the client side to have a parent (whether that's schedule create or workflow start). The question becomes, how do you do that? You don't have an interceptor to apply per client-workflow-start for a schedule, it's started server side, and one large outer span for a schedule would get huge and live forever.

Simply put, spans are applied SDK side, and there is no SDK side interception for starting workflows from a schedule. The server does not start spans for a user. Open to suggestions here.

@cretz I believe the outer span should represent the workflow start. I completely agree with your point that having the schedule creation as the outer span, with all workflows starting inside, would be not be the best. While I might not have all the intricate details, I firmly believe it should function similarly to when 'start_workflow' is invoked. In my perspective, when the schedule triggers a workflow execution, it should initiate a trace beginning from the workflow start. I'm uncertain why, currently, only 'RunActivity' spans are being generated.

cretz commented 9 months ago

The problem is that for schedules the server invokes start workflow and not your client. Therefore the server can't possibly know what OTel server you've setup with your SDK to send spans to (or that you're not using OpenTracing or DataDog or whatever other tracer). But we are considering options on possibly creating an outer span when the workflow is first run (knowing that it may have duplicates in rare initial-task-failed situations). We will leave this open while we figure out a solution here.

LuckySting commented 6 days ago

We fixed that problem using our implementation of TracingWorkflowInboundInterceptor. We generate a trace_id based on WorkflowRunId and set this context globally. Here is an implementation of our trace generator:

class IdempotentIdGenerator(IdGenerator):
    """TID generator for TracerProvider which repeatedly generates IDs based on idempotency_key."""

    def __init__(self, idempotency_key: str) -> None:
        self._rand = random.Random(idempotency_key)  # noqa: S311

    def generate_span_id(self) -> int:
        span_id = self._rand.getrandbits(64)
        while span_id == trace.INVALID_SPAN_ID:
            span_id = self._rand.getrandbits(64)
        return span_id

    def generate_trace_id(self) -> int:
        trace_id = self._rand.getrandbits(128)
        while trace_id == trace.INVALID_TRACE_ID:
            trace_id = self._rand.getrandbits(128)
        return trace_id