grafana / tempo

Grafana Tempo is a high volume, minimal dependency distributed tracing backend.
https://grafana.com/oss/tempo/
GNU Affero General Public License v3.0
4.02k stars 521 forks source link

Use a non-retryable HTTP status code for unrecoverable rate limit errors #3831

Open swar8080 opened 4 months ago

swar8080 commented 4 months ago

Describe the bug

Tempo seems to use 429 status codes for requests that will not succeed with retries. For example, when the request payload size exceeds a maximum. Some clients retry on 429 so this can cause in-memory export queues to fill-up when all retries are exhausted. Specifically, we are pushing spans to grafana cloud tempo using OTLP gRPC from an OpenTelemetry collector, which retries on 429s

Expected behavior When retrying the same request will always fail then consider using a different status code, like 413 Payload Too Large.

Environment: OpenTelemtry Collector 0.96.0 -> Grafana Cloud Tempo

Additional Context These are the errors we hit from the OTEL collector:

2024-06-26T20:09:22.513Z error exporterhelper/queue_sender.go:97 Exporting failed. Dropping data. {"kind": "exporter", "data_type": "traces", "name": "otlphttp", "error": "no more retries left: Throttle (0s), error: error exporting items, request to https://otlp-gateway-prod-us-central-0.grafana.net/otlp/v1/traces responded with HTTP Status Code 429, Message=grpc: received message after decompression larger than max (15728641 vs. 15728640), Details=[]", "dropped_items": 14694}

2024-06-26T20:09:40.329Z error exporterhelper/queue_sender.go:97 Exporting failed. Dropping data. {"kind": "exporter", "data_type": "traces", "name": "otlphttp", "error": "no more retries left: Throttle (0s), error: error exporting items, request to https://otlp-gateway-prod-us-central-0.grafana.net/otlp/v1/traces responded with HTTP Status Code 429, Message=RATE_LIMITED: ingestion rate limit (local: 666666 bytes, global: 19999980 bytes) exceeded while adding 25277368 bytes for user 111021, Details=[]", "dropped_items": 18869}
github-actions[bot] commented 2 months ago

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. The next time this stale check runs, the stale label will be removed if there is new activity. The issue will be closed after 15 days if there is no new activity. Please apply keepalive label to exempt this Issue.