[receiver] Create helper for server-like receivers to replace memory_limiter processor

mx-psi commented 6 months ago

The memory_limiter processor allows end-users to drop telemetry data when a memory limit is reached. It is up to receivers to apply backpressure and handle retrying of errors.

A large class of receivers spin up one or more servers that listen for telemetry in a particular format, map to pdata and send it along to the next consumer. For protocols like HTTP and gRPC there are standard ways (429, RESOURCE_EXHAUSTED) to report the memory has been exhausted and apply backpressure to clients.

We should build a server receiver helper that handles this. In particular it should:

Provide an easy way to build receivers based on one or more servers that wrap handlers to handle memory limits
Support at least HTTP and gRPC
Be extensible to add other capabilities to the helper in the future (e.g. we could tackle #4951 generally for all gRPC-based receivers?)
Be extensible to support fairness across multiple clients reporting to the same receiver endpoint when it comes to throttling them
Be configurable in the same manner as the memory limiter currently is

Having this helper available and tackling this for scraper helpers may remove the need of a memory_limiter processor.

splunkericl commented 5 months ago

Hey @dmitryax and @mx-psi ,we are helping out with the http use cases and want to confirm a few things:

the suggestion was to build a receiver helper method to wrap the log consumption with memory limiter. But this means the memory limiting is handled post HTTP connections creation and receiving the data on the server. This still poses problem on too many resources used.

Instead, there should be some kinds of http interceptor(similar to authInterceptor) with memory limiter baked in. So users can just plug it in when they spawn the http server. Is my understanding correct?

mx-psi commented 5 months ago

An interceptor sounds like a good approach. We should keep the requirements in mind (I wrote a list on the original post, we can discuss them if they are too stringent but they seem like a reasonable list for starting to think about this)

bogdandrutu commented 3 months ago

@mx-psi I am not sure if this is the right approach, here are some use-cases that would be good the new design (if any) should cover:

If collector is a "side-car", only converts from OTel format to DataDog (for example) and does not have any "buffering". If the DD exporter reports RESOURCE_EXHAUSTED we should actually probably send that back to the client.
For other errors like "Retry-After" if an exporter gets that would be good to back-propagate it.
There may be other components (exporter, processor, connector) that may return errors that the receiver may want to do something special with it (RESOURCE_EXHAUSTED, Retry-After for back-pressure, PERMISSION_DENIED, etc.).

I personally think we should actually adopt the grpc status (which is actually inside Google called Task::Status because they are generic enough for almost all use-cases including disk operations, RPCs, etc.) and return a error like Status from our Consumer funcs. Then we the receiver can read these "canonical" codes and do what is appropriate for them.

I know that in some scenarios, we may be able for example for memory to reduce the pressure earlier (e.g. if it is a http.Handler may reject the request before parsing/uncompressing etc.), but I see this as an optimization of the more generic problem we have right now, which is that we don't have a set of canonical codes/errors that we return from the pipeline components to better understand/react upstream.

dmitryax commented 3 months ago

To clarify the reported issue, right now, the memory_limiter processor doesn't drop any data; it just rejects it, and receivers decide what to do with that. Push-based gRPC/HTTP receivers should propagate the error to the caller as a retriable error.

I like the idea of standardizing on the gRPC codes in the collector, as @bogdandrutu suggested. However, I think this probably can be scoped as a separate effort. If we define that using memory_limiter on the receiver side should return 429 and RESOURCE_EXHAUSTED when memory limit is reached, it won't affect the adoption of grpc status codes from the caller perspective, right?

I think the problem that this issue should solve is that the memory_limiter as a processor is not efficient in preventing the collector from running out of memory for non-OTLP/GPRC receivers. Even if the memory_limiter processor reaches the threshold, receivers still accept requests, which can take a significant amount of memory before the memory_limiter can reject the pdata. Moving it to the receiver side introduces better protection from OOMs.

open-telemetry / opentelemetry-collector

[receiver] Create helper for server-like receivers to replace memory_limiter processor #8632