Open mx-psi opened 6 months ago
Hey @dmitryax and @mx-psi ,we are helping out with the http use cases and want to confirm a few things:
the suggestion was to build a receiver helper method to wrap the log consumption with memory limiter. But this means the memory limiting is handled post HTTP connections creation and receiving the data on the server. This still poses problem on too many resources used.
Instead, there should be some kinds of http interceptor(similar to authInterceptor) with memory limiter baked in. So users can just plug it in when they spawn the http server. Is my understanding correct?
An interceptor sounds like a good approach. We should keep the requirements in mind (I wrote a list on the original post, we can discuss them if they are too stringent but they seem like a reasonable list for starting to think about this)
@mx-psi I am not sure if this is the right approach, here are some use-cases that would be good the new design (if any) should cover:
RESOURCE_EXHAUSTED
we should actually probably send that back to the client.RESOURCE_EXHAUSTED
, Retry-After
for back-pressure, PERMISSION_DENIED
, etc.).I personally think we should actually adopt the grpc status (which is actually inside Google called Task::Status because they are generic enough for almost all use-cases including disk operations, RPCs, etc.) and return a error like Status from our Consumer funcs. Then we the receiver can read these "canonical" codes and do what is appropriate for them.
I know that in some scenarios, we may be able for example for memory to reduce the pressure earlier (e.g. if it is a http.Handler may reject the request before parsing/uncompressing etc.), but I see this as an optimization of the more generic problem we have right now, which is that we don't have a set of canonical codes/errors that we return from the pipeline components to better understand/react upstream.
To clarify the reported issue, right now, the memory_limiter
processor doesn't drop any data; it just rejects it, and receivers decide what to do with that. Push-based gRPC/HTTP receivers should propagate the error to the caller as a retriable error.
I like the idea of standardizing on the gRPC codes in the collector, as @bogdandrutu suggested. However, I think this probably can be scoped as a separate effort. If we define that using memory_limiter
on the receiver side should return 429 and RESOURCE_EXHAUSTED when memory limit is reached, it won't affect the adoption of grpc status codes from the caller perspective, right?
I think the problem that this issue should solve is that the memory_limiter
as a processor is not efficient in preventing the collector from running out of memory for non-OTLP/GPRC receivers. Even if the memory_limiter
processor reaches the threshold, receivers still accept requests, which can take a significant amount of memory before the memory_limiter can reject the pdata. Moving it to the receiver side introduces better protection from OOMs.
The
memory_limiter
processor allows end-users to drop telemetry data when a memory limit is reached. It is up to receivers to apply backpressure and handle retrying of errors.A large class of receivers spin up one or more servers that listen for telemetry in a particular format, map to pdata and send it along to the next consumer. For protocols like HTTP and gRPC there are standard ways (429, RESOURCE_EXHAUSTED) to report the memory has been exhausted and apply backpressure to clients.
We should build a server receiver helper that handles this. In particular it should:
Having this helper available and tackling this for scraper helpers may remove the need of a
memory_limiter
processor.