Open jlegrone opened 2 years ago
The problem here is that the interceptor is for intercepting literal calls to activity.GetLogger
and workflow.GetLogger
, not just all loggers. You can provide your own logger at the client level to intercept logs originating outside those explicit user calls.
But you may need the context. We can do what we did with data converters. Any data converter that implements the following interface:
type ContextAware interface {
WithWorkflowContext(ctx Context) converter.DataConverter
WithContext(ctx context.Context) converter.DataConverter
}
Will have those functions called before the converter is used. So maybe we can make a:
type ContextAwareLogger interface {
WithWorkflowContext(ctx Context) log.Logger
WithContext(ctx context.Context) log.Logger
}
And invoke those when impl'd by the logger everywhere we have a context. The overhead should be negligible. Thoughts?
A ContextAwareLogger
interface sounds like it could be useful... but when the SDK already has a workflow or activity context available to pass into the client's default logger, why not call workflow.GetLogger
or activity.GetLogger
instead so that the interceptor chain is invoked as well?
My assumption was that GetLogger
was as a result of a literal GetLogger
call, and not just before every logger use to make sure it has the latest context info. However, it might make sense to just do this and we need to just be clear that unlike other outbound interceptors, this is not 1:1 with user calls, it is more frequent and therefore needs to be careful to remain performant.
After some thought, I am afraid of repeatedly calling GetLogger
internally instead of its original purpose of only being invoked when the user invokes it. I am worried about going from people understanding that it's only called per user call, to now dozens of internal calls since the context can change many times throughout the life of a workflow/activity and one may want such context updates before each log statement.
I am back to considering the ContextAwareLogger
approach.
I'd love to revisit this issue. We're looking for ways to assess the scope of impact for incidents caused by workflow code panics and other non-versioned/deterministic changes.
A common failure mode we've observed is that a small number of top level requests will generate a large number of Go SDK generated error logs across a broad set of child workflows. We're looking for a way to quickly identify how many root/top level parent workflows are impacted by aggregating across this set of logs, but are limited by the fact that we have no way of injecting custom request-scoped attributes in Go SDK logs.
For example, aggregating workflow error logs by dd.trace_id would allow us to identify how many top requests are impacted and help operators make more informed decisions about whether these top level workflows should be terminated vs. rolling out a new worker version.
Is your feature request related to a problem? Please describe.
Interceptors are already able to modify log fields emitted by workflow and activity code. But any lower-level logs emitted by the Go SDK don't respect the logger returned by the interceptor chain. These include debug level logs like "ExecuteActivity" as well as warning or error level logs like "Task processing failed with error".
Using the logger returned by the interceptor chain would allow users to filter all logs related to their workflow executions by domain-specific fields, create more sophisticated monitors or metric generation pipelines, and correlate logs to traces on observability platforms where that is supported.
Describe the solution you'd like
Evaluate the worker's interceptor chain to get a workflow or activity logger whenever a low-level log needs to be emitted by the Go SDK.
Describe alternatives you've considered
It might be possible to at least handle all debug level logs like "ExecuteActivity" via a regular interceptor that is registered by default.
It would be interesting to look into implementing low level logs like "Task processing failed with error" via an interceptor as well, perhaps by extending the workflow/activity interceptor interfaces to include a method for recording SDK errors. This would bring some additional benefits since interceptors would not only be able to log those errors, but emit custom metrics or error trace spans as well.
Additional context
I wrote an e2e test to demonstrate. After running the test, observe that the
trace_id
log field is not included in the "ExecuteActivity" log line.