open-telemetry / opentelemetry-specification

Specifications for OpenTelemetry
https://opentelemetry.io
Apache License 2.0
3.68k stars 883 forks source link

Retryable HTTP Statuses should be configurable in OTLP clients #3876

Open haus opened 6 months ago

haus commented 6 months ago

The OTLP spec lists 502, 503 and 504 as the only retryable 50x status codes. However some servers (and some CDNs) return a 500, even though it isn't the most appropriate status code, as a generic "something went wrong serving that request". For cases where it is known that the remote server may return a 500 for retryable conditions, it would be useful if the retryable http statuses could be extended or configured to include it. That would help prevent data from being lost under these conditions.

jack-berg commented 6 months ago

This seems like a good thing to do on first glance, since a OTLP receiver doesn't always have control of the HTTP status codes returned. However, in practice, solving this issue will be tricky because as of today since there is no specification for common retry configuration options. It appears that solving this issue would need to be part of a larger effort to normalize OTLP retry configuration across SDKs. Marking this as "triage accepted", but whoever takes this on should be conscious of the bigger picture.

Some related issues: #3314, #3639, #1742, #1528

jack-berg commented 2 weeks ago

We discussed issues with the OTLP retry spec broadly in the 8/7/14 and 8/14/24 TC meetings. I wrote a document summarizing a number of somewhat overlapping OTLP retry issues, and sketching out some proposals on how to fix them.

For this specific issue, there was apparent consensus that retryable status codes should be configurable. In my previous comment I mentioned that this would be tricky because there is no specification for common retry configuration options, and that's true, but we should try to separate the issues. The lack of specification around the retry exponential backoff algorithm has led to diverging stable implementations as outlined in #4138, but that all implementations should be roughly aligned on the set of status codes which are retryable. It seems plausible, and maybe even straight forward, to introduce an option for the OTLP SDK exporter specification which makes the retryable status codes configurable.