microsoft / durabletask-go

The Durable Task Framework is a lightweight, embeddable engine for writing durable, fault-tolerant business logic (orchestrations) as ordinary code.
Apache License 2.0
178 stars 25 forks source link

Support for injection of otel spans into tasks for enhancement #31

Open cchanley2003 opened 9 months ago

cchanley2003 commented 9 months ago

Dapr workflows looks like it uses durable tasks. In order to support this request: https://github.com/dapr/dapr/issues/6906 it looks like we would need to support the ability to inject spans into tasks to allow client to add onto spans and add attributes to existing spans.

First wanted to confirm that what I being requested isn't currently supported and also confirm that you would accept a PR for this enhancement?

Initial though was to put span into: https://github.com/microsoft/durabletask-python/blob/4046191d28a6f0a7779d829e5aec88a06493f3d8/durabletask/task.py#L339 to allow clients to enhance spans...

Maybe this is not the right repo to start this process. If so can somebody point me in the right direction?

cgillum commented 8 months ago

Hi @cchanley2003. While I think this request may require work in other repos as well (i.e., the SDK repos), this is a good repo for getting things going, especially since your request relates to Dapr.

I responded just now to https://github.com/dapr/dapr/issues/6906. Please take a look at that when you get a chance.

I'm interpreting your request to be surfacing OTel span information into the activity task so that you're able to correlate your own custom logs or attributes with the end-to-end distributed trace. Currently, each SDK should have access to the parent trace context, but I don't think we're currently making that context available to the user code. If I understand the ask correctly, I think this is definitely something we can do.

cchanley2003 commented 8 months ago

Yes. In an ideal world we surface the span at the WorkflowActivityContext in dapr, which implies adding it to the ActivityContext in durable tasks.

So if I have a activity function definition:

def model_properties(ctx: WorkflowActivityContext, input):

From the ctx I could access the otel span. It looks like some work has been started in this direction with the proto buf definitions to add trace info, so the work seems like it would just be getting that added to the streams... but that was my initial glance.

cchanley2003 commented 8 months ago

So I have been looking more into this. One thing I have noticed is that the go version points to a distributed tracing branch of the proto-bufs file. The python sdk points to the main branch. It looks like those version are incompatible. For instance:

message ExecutionStartedEvent {
...
 TraceContext parentTraceContext = 7;
 ..

in the branch. vs

google.protobuf.StringValue correlationData = 7;

in the main which is what the python sdk points at. Maybe this is a better discussion for the python sdk project? I want to leverage the trace context as a jumping of point.

Also is this the best place to have this discussion? Let me know.

cgillum commented 8 months ago

Oh, interesting. I'll need to take a closer look at this to figure out what happened. I believe that the top version is the "correct" one, but I'm not sure how for things have gotten out of sync.

cgillum commented 8 months ago

@cchanley2003 thanks for pointing out this misalignment with the protobufs. I went ahead and corrected this as part of https://github.com/microsoft/durabletask-go/pull/43. Unfortunately, it's unlikely that the Dapr runtime will be updated with this new build until 1.13 (unless somehow we managed to bundle it in as part of a hotfix).

cchanley2003 commented 7 months ago

Thanks for updating that. Looks like things are in sync now. Without a reference implementation I am a little confused about the intent of the parentTrace in the ExecutionStartedEvent. Can you comment on the design intent there? Here was my initial thought of an implementation but I want to be sure I am swimming upstream. I am going to focus on the Activity use case. I imagine something similar for orchestration.

  1. Backend provides a parentTraceContext as part of ActivityRequest
  2. client sdk executor receives ActivityRequest and starts a new span off of parentTrace. It injects span into ActivityContext
  3. Client code uses span to do what it will (sub spans, attributes)
  4. client sdk activity executor closes span

This implies that the client needs to have the ability to wire in a backend for otel, which is doesn't exist atm.

As I mentioned before in Discord I am willing to submit PRs to make this happen. I am also willing to do this across all the sdks, but I want to be sure we are in accord on design. None of what I have mentioned uses the ExecutionStarted events so I feel I might be off base on approach.

Thanks for your time and appreciate any feedback.

cgillum commented 7 months ago
  1. Backend provides a parentTraceContext as part of ActivityRequest

I haven't yet done the investigation to know if this step is required, but it might be. Alternatively, it's possible that the parent trace context is already flowing as part of the gRPC request since (I think) the context we use to send the gRPC request already has some trace context information associated with it. This part needs testing/experimentation.

The rest of the steps look correct to me.

Adding @kaibocai who has also expressed interest in working on this problem but let us know @cchanley2003 if you'd prefer to contribute it. :)

kaibocai commented 7 months ago

Hi @cchanley2003 ,

I think you are right on ActivityRequest it needs to have parentTraceContext in order to provide the ability for activities to add their spans. Currently, the code has parentTraceContext ExecutionStartedEvent, I think that's because we want to provide the same ability for workflows. But definitely, parentTraceContext needs to be added to ActivityRequest as well.

kaibocai commented 7 months ago

Regards to context propagation through Grpc, the recommend way from OTEL is to have trace context encoded into the metadata of Grpc message instead of the message itself.

OTEL already has SDK that helps to simplify context propagation through Grpc, still researching in this direction. Java instrumentation library: https://github.com/open-telemetry/opentelemetry-java-instrumentation/tree/main/instrumentation/grpc-1.6/library#library-instrumentation-for-grpc-160 Go should have similar support, will continue researching.

@cgillum

cchanley2003 commented 7 months ago

Thanks for the replies @kaibocai. So yeah I saw the ExecutionStartedEvent. My focus is on the the python SDK (though I am willing to do fixes in other sdks) and it didn't look like that Event was used at all so I was uncertain on where it fit in with propagating the traces (or even the intent there). So it would seem that the first step would be the add the traceContext to the ActivityRequest in the .protos? I can (or somebody) start with a PR for that?

kaibocai commented 7 months ago

Hi @cchanley2003 , I am new to OTEL, so maybe I am wrong about the following statement. Based on my research, I found the suggested way to propagate context through grpc is not to encode the trace context through the Grpc message, instead it's recommended to propagate through the metadata of grpc message, and it seems OTEL already has libraries helping to do this work.

For instance, the Python library can be found here https://github.com/open-telemetry/opentelemetry-python-contrib/tree/main/instrumentation/opentelemetry-instrumentation-grpc, have you ever considered this direction, so that no updates on the protobuf is needed. I believe it will simplify the implementation.

Also, it would be great if you could help with Python SDK for this support I can help as well! Thank you!

cchanley2003 commented 7 months ago

Yes, that makes sense. Just saw your earlier post. So no need to update the proto.

kaibocai commented 6 months ago

I did more research and tested a bit on my local, it seems the instrumentation libraries cannot meet our requirements here. The instrumentation libraries will simplify tracing the Grpc call, but it doesn't provide an easy way to pass through trace context (no easy access to the trace context inert to the grpc call metadata). So I think the original direction (encode trace context into Grpc message) is more suitable here.

shivamkm07 commented 5 months ago

I did more research and tested a bit on my local, it seems the instrumentation libraries cannot meet our requirements here. The instrumentation libraries will simplify tracing the Grpc call, but it doesn't provide an easy way to pass through trace context (no easy access to the trace context inert to the grpc call metadata). So I think the original direction (encode trace context into Grpc message) is more suitable here.

Agree @kaibocai. I looked into it as well. We can easily propagate context for a normal grpc request(non-streaming) using otel interceptors(e.g. here) but for streaming requests, context is propagated at the start only i.e. when the stream is created. Once the stream creation is done, then since no context is passed during stream.send() method, and because we can't update stream context, the newer context(updated after stream creation) is never passed to client/server.

Here in DTF-Go as well at the time of stream creation, there is nothing in context(since no orchestration/acitivity are created yet). We have the updated context(with all span info) when activity/orchestration workitem needs to be sent to the app but this context is not passed while sending the workItem. So best way to do it is to encode trace context in the message itself.

pralonga commented 3 months ago

Any update on the issue and the related PRs ? Having the parent span in the client activities would be nice.

I am using dapr-workflow with the dotnet SDK, but most of the work done in the PRs should resolve the issue there too.