open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
3k stars 2.32k forks source link

[exporter/prometheusremotewrite] Prometheus remote write exporter retry on HTTP 429 #31032

Open Bruno-DaSilva opened 8 months ago

Bruno-DaSilva commented 8 months ago

Component(s)

exporter/prometheusremotewrite

Is your feature request related to a problem? Please describe.

Prometheus remote write exporter does not currently retry on HTTP 429s TooManyRequests. Many remote write destinations implement 429 and it's not great if we're dropping metrics because we're being temporarily rate limited.

For context, the remotewrite spec says:

Prometheus Remote Write compatible senders MUST retry write requests on HTTP 5xx responses and MUST use a backoff algorithm to prevent overwhelming the server. They MUST NOT retry write requests on HTTP 2xx and 4xx responses other than 429. They MAY retry on HTTP 429 responses, which could result in senders "falling behind" if the server cannot keep up. This is done to ensure data is not lost when there are server side errors, and progress is made when there are client side errors.

Describe the solution you'd like

The prometheus reference implementation here retries on 429: https://github.com/prometheus/prometheus/blob/main/storage/remote/client.go#L239-L242 It also uses the 'retry after duration' header to modulate the retry backoff -- not sure how best to handle that in the otel implementation.

The current implementation retries on 500, so I figure we could just add a 429 check to this same if statement, at minimum. https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/exporter/prometheusremotewriteexporter/exporter.go#L270-L273

Describe alternatives you've considered

No response

Additional context

No response

github-actions[bot] commented 8 months ago

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

Bruno-DaSilva commented 8 months ago

xref https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/20304 where 5xx retrying was implemented.

crobert-1 commented 8 months ago

From the spec this looks like a valid request to me. The code owners may have more thoughts though as they have more context than I do.

jmichalek132 commented 6 months ago

Hi, could I get this assigned please, working a draft.

crobert-1 commented 6 months ago

Thanks @jmichalek132!

jmichalek132 commented 5 months ago

Related, open issue on the backoff library to implement support for retry after https://github.com/cenkalti/backoff/issues/134.

github-actions[bot] commented 3 months ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

jmichalek132 commented 1 month ago

not stale