open-telemetry / opentelemetry-configuration

JSON Schema definitions for OpenTelemetry file configuration
Apache License 2.0
41 stars 17 forks source link

add exporter retry configuration #97

Open brettmc opened 5 months ago

brettmc commented 5 months ago

https://opentelemetry.io/docs/specs/otlp/#otlphttp-throttling and https://opentelemetry.io/docs/specs/otlp/#otlpgrpc-throttling describe how a client SHOULD implement an exponential backoff strategy (with jitter) in case of retryable export failures. The inputs to this strategy are usually the initial delay, and max attempts. Also added to zipkin exporter, since it could also be implemented there.

brettmc commented 4 months ago

I wonder how consistent the implementations of this retry are

It looks like not at all consistent, and I just found https://github.com/open-telemetry/opentelemetry-specification/issues/3639 which describes it quite well:

The implementation of the exponential backoff strategy is not specified

It seems like the root of the issue then is that we cannot configure retry consistently across SIGs if we haven't agreed on the inputs. I can think of a couple of possible next steps:

svrnm commented 4 months ago

go back to the spec and try to come up with a specification for how retry should be configured, which everybody can agree on

This sounds like the best path forward, with the other option being the fallback. @brettmc would you mind raising the spec issue, with

FYI, this is something related to https://github.com/open-telemetry/opentelemetry-specification/issues/4083

jack-berg commented 4 months ago

To add to @codeboten analysis of existing implementations, here is java's set of options

These options mirror gRPC's built in retry mechanism, since gRPC was very influential in the development of OTLP and it seemed reasonable that OTLP clients using the gRPC version of the protocol would utilize the gRPC clients' built in mechanism.

One concrete thing we can / should do while we sort out how these options are configurable across languages is simply give the ability to enable / disable retry. Regardless of the options, every langauge should support some retry mechanism since the spec requires it in clear terms:

Transient errors MUST be handled with a retry strategy. This retry strategy MUST implement an exponential back-off with jitter to avoid overwhelming the destination until the network is restored or the destination has recovered.

We could do this by providing a top level retry_enabled property for the OTLP exporter type:

otlp:
  endpoint: http://localhost:4318
  retry_enabled: true
  # add in the future
  # retry_options:

Later when we add specific options we could add a separate retry_options type.

Or we could define a new retry type, with only a enabled property to start, with the intent to expand later when we decide on the options:

otlp:
  endpoint: http://localhost:4318
  retry:
    enabled: true
    # add additional options in the future
brettmc commented 4 months ago

Or we could define a new retry type, with only a enabled property to start, with the intent to expand later when we decide on the options

I like this one, combining all retry-related options into one type which we can expand on.

jack-berg commented 4 months ago

I like this one, combining all retry-related options into one type which we can expand on.

👍 This sounds good to me. Do you have interest in updating this PR to that effect? Or did you want to wait until the specific options are worked out at the spec lever (which could be some time).

brettmc commented 4 months ago

Do you have interest in updating this PR to that effect?

Updated to only implement retry.enabled (boolean, default true). The rest of the configuration will need to wait until there is agreement on what inputs are required.

jack-berg commented 3 months ago

@brettmc it doesn't seem like there's consensus to add the corresponding option in the spec (spec PR #4148). If you know of use cases where PHP users want to disable retry, please comment on that PR.

Else, it seems like we are blocked on this until specification issues are resolved.