opensearch-project / data-prepper

OpenSearch Data Prepper is a component of the OpenSearch project that accepts, filters, transforms, enriches, and routes data at scale.
https://opensearch.org/docs/latest/clients/data-prepper/index/
Apache License 2.0
262 stars 202 forks source link

Improve or Clarify Backpressure on OTel-Grpc-Sources #4119

Closed KarstenSchnitter closed 3 days ago

KarstenSchnitter commented 9 months ago

Is your feature request related to a problem? Please describe. Ingestion of events via Opentelemetry over GRPC can fail due to a full buffer or an open CircuitBreaker. In that case, an error response is sent. When testing with the OpenTelemetry Collector, this leads to dropped data by the collector. Reason for this is, that apparently the OTLP/gRPC Throttling specification is not used properly. The DataPrepper gRPC response needs to contain a proper RetryInfo.

Describe the solution you'd like DataPrepper should implement proper OTLP/gRPC throttling, when the ingress buffers are full or the circuit breaker is open.

Describe alternatives you've considered (Optional) DataPrepper can document, that on full buffers or open circuit breakers, requests are rejected as non-retryable. This can happen for both cases or separately just for one, e.g. open circuit breaker.

Additional context Currently on an open curcuit breaker, the error message from the CircuitBreakingBuffer is contained in the DataPrepper response. This is managed by the GrpcRequestExceptionHandler, that generates a RESOURCE_EXHAUSTEDresponse status. This is only retryable, if a RetryInfo is present according to https://opentelemetry.io/docs/specs/otlp/#failures. This is currently not the case, hence the OpenTelemetry collector drops the events and does no retries.

KarstenSchnitter commented 9 months ago

@dlvenable please let me know, if you consider the requests as retryable in the described scenarios. If so, I can try to provide a fix, that adds the missing RetryInfo.

dlvenable commented 9 months ago

@KarstenSchnitter , For both a full buffer and the circuit breaker, we can tell the client that the request can be retried.

What I'm unsure of is what time we can give back to the client. One option would be to use the request timeout. Perhaps there is something we can get from the buffer to hint at the time. But, I'm not sure.

KarstenSchnitter commented 8 months ago

To my understanding, to provide a RetryInfo, the GrpcRequestExceptionHandler needs to implement GoogleGrpcExceptionHandlerFunction. When providing com.google.rpc.Status the RetryInfo can be added as Details. I am unsure, if there is an alternative approach using armeria response headers.

For the retry delay I would suggest not to try determining buffer or circuit breaker stats. There is the option to keep some state in the exception handler to remember the last time, a RetryInfo was generated. When the next request fails during the returned delay period + some grace time, I suggest to double the delay reported in the next response. This will generate an exponential back-off. If the delay reaches some threshold, the RetryInfo could be dropped entirely. In that case a metric for "rejected" events might be helpful.

dlvenable commented 8 months ago

@KarstenSchnitter , That approach sounds reasonable. And we could allow for configuration of the initial retry and the maximum.