Open brettmc opened 5 months ago
I wonder how consistent the implementations of this retry are
It looks like not at all consistent, and I just found https://github.com/open-telemetry/opentelemetry-specification/issues/3639 which describes it quite well:
The implementation of the exponential backoff strategy is not specified
It seems like the root of the issue then is that we cannot configure retry consistently across SIGs if we haven't agreed on the inputs. I can think of a couple of possible next steps:
go back to the spec and try to come up with a specification for how retry should be configured, which everybody can agree on
This sounds like the best path forward, with the other option being the fallback. @brettmc would you mind raising the spec issue, with
FYI, this is something related to https://github.com/open-telemetry/opentelemetry-specification/issues/4083
To add to @codeboten analysis of existing implementations, here is java's set of options
These options mirror gRPC's built in retry mechanism, since gRPC was very influential in the development of OTLP and it seemed reasonable that OTLP clients using the gRPC version of the protocol would utilize the gRPC clients' built in mechanism.
One concrete thing we can / should do while we sort out how these options are configurable across languages is simply give the ability to enable / disable retry. Regardless of the options, every langauge should support some retry mechanism since the spec requires it in clear terms:
Transient errors MUST be handled with a retry strategy. This retry strategy MUST implement an exponential back-off with jitter to avoid overwhelming the destination until the network is restored or the destination has recovered.
We could do this by providing a top level retry_enabled
property for the OTLP exporter type:
otlp:
endpoint: http://localhost:4318
retry_enabled: true
# add in the future
# retry_options:
Later when we add specific options we could add a separate retry_options
type.
Or we could define a new retry
type, with only a enabled
property to start, with the intent to expand later when we decide on the options:
otlp:
endpoint: http://localhost:4318
retry:
enabled: true
# add additional options in the future
Or we could define a new retry type, with only a enabled property to start, with the intent to expand later when we decide on the options
I like this one, combining all retry-related options into one type which we can expand on.
I like this one, combining all retry-related options into one type which we can expand on.
👍 This sounds good to me. Do you have interest in updating this PR to that effect? Or did you want to wait until the specific options are worked out at the spec lever (which could be some time).
Do you have interest in updating this PR to that effect?
Updated to only implement retry.enabled (boolean, default true). The rest of the configuration will need to wait until there is agreement on what inputs are required.
@brettmc it doesn't seem like there's consensus to add the corresponding option in the spec (spec PR #4148). If you know of use cases where PHP users want to disable retry, please comment on that PR.
Else, it seems like we are blocked on this until specification issues are resolved.
https://opentelemetry.io/docs/specs/otlp/#otlphttp-throttling and https://opentelemetry.io/docs/specs/otlp/#otlpgrpc-throttling describe how a client SHOULD implement an exponential backoff strategy (with jitter) in case of retryable export failures. The inputs to this strategy are usually the initial delay, and max attempts. Also added to zipkin exporter, since it could also be implemented there.