Clarify and improve the OTLP exporter requirements of a retry strategy

alanwest commented 1 year ago

The OTLP exporter specification lacks clarity with regards to its retry strategy. I have reviewed a number of implementations across languages as well as the collector. From what I have seen, implementations deviate in various ways from one another.

The following statement is about as clear as the spec gets:

Transient errors MUST be handled with a retry strategy.

Beyond this, the spec descends into weak statements that lend themselves to differing interpretations and it even contradicts itself in a number of spots. Below is a list of areas where I believe improvements are necessary.

The retry strategy of the OTLP exporter is not specified

This has posed a number of issues:

SDKs differ in their implementation of their retry strategy.
Different vendors have expressed varying requirements with regards to retry.

I work for a vendor and the first point concerns me the most. When speaking with my end users, I cannot be confident that when my company's OTLP endpoint signals a request can be retried that this signal will be handled uniformly irrespective of language SDK used.

The second point is primarily what has blocked the .NET SIG from implementing any retry strategy at all. It may be necessary for multiple strategies to be specified allowing end users to opt in to the strategy that best meets their needs.

Transient error needs to be clearly defined

The spec lacks a clear definition for what is considered a transient error and therefore what MUST be retried by the OTLP exporter.

Response codes

The spec suggests that transient errors are defined, in part, by the response codes returned by the server.

Specifically with regards to HTTP status codes the exporter spec contradicts the protocol spec.

For example, my conclusion from the following statement is that an exporter MUST retry HTTP 408 and all 5xx response codes.

For OTLP/HTTP, the errors 408 (Request Timeout) and 5xx (Server Errors) are defined as transient

However, I have doubts this interpretation was intended as the statement is contradicted in at least two ways by the protocol spec which:

Contains a table of a very limited set of retryable 4xx/5xx HTTP response codes.
Downgrades the normative language defining a client's behavior from a MUST retry to a SHOULD retry.

Beyond response codes

The spec for OTLP/HTTP contains two more scenarios where a client should (must?) retry. It is unclear whether these scenarios are considered transient errors. If they are considered transient errors, then it seems these statements need to be MUST rather than SHOULD.

The spec for OTLP/gRPC does not contain similar language. Should it?

Exponential backoff w/ jitter

There are multiple statements in the spec that indicate a client needs to employ an exponential backoff w/ jitter when retrying. I have two observations with regards to this requirement.

The requirement from the exporter spec is stronger than that of the protocol spec

For example, in the following two statements, one uses MUST and the other SHOULD.

Is it intentional that the exporter spec has a stronger requirement than the protocol spec?

The implementation of the exponential backoff strategy is not specified

Perhaps leaving the implementation unspecified was intentional. However, for the benefit of consistency across SDKs, I think it would beneficial to specify an implementation. I do not think this needs to be complicated. The gRPC retry spec offers some simple prior art.

When should the OTLP exporter stop retrying?

It is not clearly specified if and when the OTLP exporter should give up retrying.

It might be assumed that the exporter should continue to retry up until its configured timeout. However, if this is the case, then the spec needs to indicate this.

Another option could be to specify a maximum number of retry attempts. The Java implementation has done this, for example.

alanwest commented 1 year ago

There is a fair amount to unpack here, and I believe this issue can be handled iteratively. I am happy step in and drive the improvements, however I plan to attend the next specification meeting to get a gauge for whether the community shares the concerns I present here.

jack-berg commented 1 year ago

Thanks for summarizing Alan. I'm in favor of fixing these inconsistencies and will help where I can.

Given that the protocol document is stable and also contains apparent contradictions, I'm wondering what our options are. For example, if the spec says "MUST do foo" and "MUST do bar", and foo and bar are mutually exclusive statements, then any change could be interpreted as a breaking change. But if foo and bar truly cannot coexist, then I think we have no choice but to break the stalemate and choose one as correct.

Maybe we can do an exercise to analyze the history of the relevant documents, and summarize the spirit of the spec with respect to retries in plain terms. Then, assuming we can all agree on the spirit of spec, we can rip out all the statements and create a new section dedicated to retries which fixes all the issues. In the case of contradictions, we can only choose one of the two statements, and we can eliminate the other as a bug fix so that its not a breaking change. It clearly was nobody's intention to create a specification which contradicts itself - this case just seems like an accidental consequence of developing the document iteratively by several authors.

brettmc commented 4 months ago

As a SIG maintainer, I'm also interested in the "retry implementation is not specified" section here. Various SDKs have implemented an exponential backoff retry, but in different ways.

We have users asking how they can configure retry policy (eg https://github.com/open-telemetry/opentelemetry-php/issues/1317). We could do it using language-specific env vars (OTEL_PHP_xxx for us), but that doesn't help with file-based configuration, which is meant to work with all languages.

jack-berg commented 3 months ago

We discussed issues with the OTLP retry spec broadly in the 8/7/14 and 8/14/24 TC meetings. I wrote a document summarizing a number of somewhat overlapping OTLP retry issues, and sketching out some proposals on how to fix them.

With respect to this specific issue, there was consensus that we should clean up all spec contradictions, independent of progress on other retry related issues.

Its worth noting that some of the things described in the issue are disagreements between the OTLP SDK exporter specification and the OTLP specification. It should be possible for the OTLP SDK exporter spec to extend language in the OTLP specification with stronger normative language, but it should not be possible for the OTLP SDK exporter spec to have weaker normative language that the OTLP specification. For example, its acceptable for the OTLP specification to state "transient errors SHOULD be retried", and for the OTLP SDK specification to strengthen that to "transient errors MUST be retried". Still, we should evaluate each of these disagreements and decide if they are allowed and desirable.

Breaking down the different points in the issue:

The retry strategy of the OTLP exporter is not specified There are issues for additional flexibility around retry (#3645, #3876) and around clarifying the specifics of the exponential backoff algorithm (#4138). I think we should use this issue to focus specification around the spec inconsistencies / contradictions.
Transient error needs to be clearly defined Let's fix this with a clear, consistent set of the retryable status codes, and a clear consistent definition of whether various types of connection issues are considered transient retryable errors.
Exponential backoff w/ jitter As discussed above, it may be ok / desirable for the OTLP SDK exporter spec to have stronger normative language than the OTLP specification. Let's use #4138 to track improvements around the specification of the algorithm.
When should the OTLP exporter stop retrying? We have #2346 requesting clarification on this specific issue.

luqasz commented 2 months ago

I've encountered timeouts in AWS lambda. When exporter can't reach collector, lambda will go into infinite retries (even when function has completed) causing it to timeout and pay for wasted time. IMO retry policy should be configurable. E.G. stop retries after X retries.

open-telemetry / opentelemetry-specification