Closed jesusvazquez closed 5 months ago
As our recommended OTel ingestion architecture relies on the batching of the metrics, spans, and logs with the OpenTelemetry Collector Batch Processor (default send_batch_size=8912
) and then are exported with the OTel Collector OTLP HTTP Exporter, shall we verify the error message is human readable through the OpenTelemetry Collector logs?
@ying-jeanne and I are picking this up :) @jesusvazquez and I are leaning towards writing a new distributor.handler
function specifically for the OTLP endpoint, since it will probably lend itself to easier maintenance than sharing handler
with the normal remote write endpoint.
This OTLP specific handler should respond with Status
protobuf messages for error cases.
Describe the bug
The handler that wraps around the otlp endpoint on distributors replies with a normal string in the body which is the normal behavior for all Mimir endpoints but its not what the otel spec defines.
In https://opentelemetry.io/docs/specs/otlp/#failures-1 we can see that:
Since Mimir only supports otlp through HTTP we can go to the otlphttpexporter in the collector and confirm that the client is expecting this marshaled proto of the Status struct here
Now because Mimir replies with a normal string instead of whats expected by the client and spec, when collectors write to Mimir and there is a 400 error the entire error message by Mimir does not appear in the collector logs making it very hard to troubleshoot the original issue
To Reproduce
Steps to reproduce the behavior:
Example script to reproduce this behavior
And here a example run
Expected behavior
The endpoint should reply with marshaled proto bytes and then we should expect to see the concise mimir errors on the collector logs.