Closed KarstenSchnitter closed 3 days ago
@dlvenable please let me know, if you consider the requests as retryable in the described scenarios. If so, I can try to provide a fix, that adds the missing RetryInfo
.
@KarstenSchnitter , For both a full buffer and the circuit breaker, we can tell the client that the request can be retried.
What I'm unsure of is what time we can give back to the client. One option would be to use the request timeout. Perhaps there is something we can get from the buffer to hint at the time. But, I'm not sure.
To my understanding, to provide a RetryInfo
, the GrpcRequestExceptionHandler needs to implement GoogleGrpcExceptionHandlerFunction. When providing com.google.rpc.Status
the RetryInfo
can be added as Details
. I am unsure, if there is an alternative approach using armeria response headers.
For the retry delay I would suggest not to try determining buffer or circuit breaker stats. There is the option to keep some state in the exception handler to remember the last time, a RetryInfo
was generated. When the next request fails during the returned delay period + some grace time, I suggest to double the delay reported in the next response. This will generate an exponential back-off. If the delay reaches some threshold, the RetryInfo
could be dropped entirely. In that case a metric for "rejected" events might be helpful.
@KarstenSchnitter , That approach sounds reasonable. And we could allow for configuration of the initial retry and the maximum.
Is your feature request related to a problem? Please describe. Ingestion of events via Opentelemetry over GRPC can fail due to a full buffer or an open CircuitBreaker. In that case, an error response is sent. When testing with the OpenTelemetry Collector, this leads to dropped data by the collector. Reason for this is, that apparently the OTLP/gRPC Throttling specification is not used properly. The DataPrepper gRPC response needs to contain a proper
RetryInfo
.Describe the solution you'd like DataPrepper should implement proper OTLP/gRPC throttling, when the ingress buffers are full or the circuit breaker is open.
Describe alternatives you've considered (Optional) DataPrepper can document, that on full buffers or open circuit breakers, requests are rejected as non-retryable. This can happen for both cases or separately just for one, e.g. open circuit breaker.
Additional context Currently on an open curcuit breaker, the error message from the CircuitBreakingBuffer is contained in the DataPrepper response. This is managed by the GrpcRequestExceptionHandler, that generates a
RESOURCE_EXHAUSTED
response status. This is only retryable, if aRetryInfo
is present according to https://opentelemetry.io/docs/specs/otlp/#failures. This is currently not the case, hence the OpenTelemetry collector drops the events and does no retries.