Open joe-elliott opened 8 months ago
A PR is up to fix the issue in the OTEL collector: https://github.com/open-telemetry/opentelemetry-collector/pull/8080
This issue has been automatically marked as stale because it has not had any activity in the past 60 days. The next time this stale check runs, the stale label will be removed if there is new activity. The issue will be closed after 15 days if there is no new activity. Please apply keepalive label to exempt this Issue.
This pretty severely affects the error rate and overall SLO for traces, as well as, 500 errors are not retried by clients which will lead to lost data. This should probably be a higher priority bug.
Linking https://github.com/grafana/tempo/issues/3831 since maybe some rate limits should use a different 4xx status code when retrying the request won't help
Describe the bug Currently Tempo returns 500 from the OTLPHTTP endpoint when 500ing because of the way errors are handled.
If the linked issue is resolved then ResourceExhausted should correctly return 429s. If it's not resolved then we would need to implement our own http server to correctly do this.