open-telemetry / opentelemetry-collector

OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
4.32k stars 1.43k forks source link

exporterhelper support for back-pressure #8245

Open jmacd opened 1 year ago

jmacd commented 1 year ago

Is your feature request related to a problem? Please describe.

For an exporter that uses exporterhelper, I have the following requirements.

These two requirements need to be met whether or not there are retries enabled, and whether or not there there is a queue, meaning whether or not QueueSettings.Enabled is true. The problem is that when QueueSettings.Enabled is true, which enables parallelism (i.e., NumConsumers > 1), then the caller will not get the error from the downstream receiver and the caller has no way to block until the receiver returns.

Here is where today's exporterhelper w/ QueueSettings.Enabled==false will block until the call returns and return the caller's error: https://github.com/open-telemetry/opentelemetry-collector/blob/ebed8dd0943a98447d7f0e265cf58ec197bff93a/exporter/exporterhelper/queued_retry.go#L291

Here is where, when QueueSettings.Enabled==true and the queue is full, the code will not block and returns a generic error immediately, instead. https://github.com/open-telemetry/opentelemetry-collector/blob/ebed8dd0943a98447d7f0e265cf58ec197bff93a/exporter/exporterhelper/queued_retry.go#L312

Here is where, when QueueSettings.Enabled==true and the queue is not full, the code returns a nil without waiting to hear from the downstream receiver. https://github.com/open-telemetry/opentelemetry-collector/blob/ebed8dd0943a98447d7f0e265cf58ec197bff93a/exporter/exporterhelper/queued_retry.go#L316

As it stands, the only way for an exporter to block on the downstream service and receive its responses synchronously is to disable queueing. When Queueing is enabled, there are two problems (a) no option to block on queue-full, (b) no option to wait on response.

Describe the solution you'd like Optional behavior that enables blocking on a queue full and waiting for a response.

Describe alternatives you've considered No alternatives come to mind.

Additional context The OTel-Arrow project has a developed a bridging-mode for OpenTelemetry Collectors that enables high-compression on the network link between two collectors. For this setup to work reliably and under high load, we need the otelarrow exporter to be configured with a blocking pipeline. We would recommend both of the optional behaviors described in this issue to otelarrow users.

dmitryax commented 1 year ago

I'm not sure I fully understand the ask here. Enabling the queue introduces asynchronous behavior by design. Why do you need the queue if there is a requirement for the blocking/synchronous pipeline? What's the purpose of the queue in that case?

jmacd commented 1 year ago

I am more-or-less requesting a feature that I would expect from the batchprocessor.

As we expect to move batchprocessor functionality into the exporterhelper, I expect queue will be a place where requests stand waiting to be batched and sent. When I disable the queue, there is no limit on the number of concurrent senders and no possibility of batching.

When I enable the queue, I expect to get batching. More importantly, I put a limit on the number of outgoing RPCs (i.e., NumConsumers) which can be important in controlling load balance.

jmacd commented 1 year ago

I will put this another way. I believe the batching facility should be independent of the facility for backpressure.

When the exporterhelper has the backpressure option I'm looking for and queueing enabled, the caller has to stick around until each data item is processed and return the error (as well as block on the caller's context timeout).

I've been working on the OTel Arrow project, which has gRPC-streaming exporter which does the same sort of thing. For each incoming request, it allocates a make(chan error, 1) to receive the response, places that channel into the queue along with the data, and then falls into a select awaiting either the response or the timeout.

See the code: https://github.com/open-telemetry/otel-arrow/blob/95211c5a139c84f7117905531dd45e2122778f06/collector/exporter/otelarrowexporter/internal/arrow/stream.go#L423

jmacd commented 3 months ago

I am happy with the latest developments in the batching support. We still require the queue sender to enable concurrency in the batch processor => so we have back-pressure but with unacceptable concurrency loss. Therefore, I will leave this open. @moh-osman3 is looking into a concurrency sender option that would let us replace the concurrentbatchprocessor.