Open edtshuma opened 3 months ago
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers
. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.
The status code returned is the clue here. 413 means entity too large. The spans were explicitly refused. This is the correct behavior.
Component(s)
No response
Describe the issue you're reporting
I have an OTEL Collector instance deployed in Gateway mode. When I query the metric for dropped spans (Grafana Explore menu) I get no data even though I experienced dropped spans at that exact timestamp. I would like to alert on "dropped span" events and for that I am starting with the following query:
otelcol_processor_dropped_spans_total{cluster_name="orion", service_name="otelcol-contrib"} @1721123498
but the query returns a count of 0:
otelcol_processor_dropped_spans_total{cluster_name="orion",instance=":8888",job="otel-agent",processor="memory_limiter",service_instance_id="6b4xxxxx-fxxx-4xxx-axxx-e1fxxxxxxxxx",service_name="otelcol-contrib",service_version="0.104.0"} 0
The OTEL Gateway receives spans, logs and metrics exported by agents running on multiple K8s clusters. On one of the K8s clusters I experienced data loss on a traces pipeline. Using LogQL I can confirm the dropped spans as below:
{namespace="monitoring", app="opentelemetry-collector", cluster_name="orion"} | json | level=~"error|warn" | ts=~"^1721123498.*"
and the output:
{"level":"error","ts":1721123498.01548,"caller":"exporterhelper/queue_sender.go:90","msg":"Exporting failed. Dropping data.","kind":"exporter","data_type":"traces","name":"zipkin/tempo","error":"no more retries left: failed the request with status code 413","dropped_items":2393,"stacktrace":"go.opentelemetry.io/collector/exporter/exporterhelper.newQueueSender.func1\n\tgo.opentelemetry.io/collector/exporter@v0.104.0/exporterhelper/queue_sender.go:90\ngo.opentelemetry.io/collector/exporter/internal/queue.(*boundedMemoryQueue[...]).Consume\n\tgo.opentelemetry.io/collector/exporter@v0.104.0/internal/queue/bounded_memory_queue.go:52\ngo.opentelemetry.io/collector/exporter/internal/queue.(*Consumers[...]).Start.func1\n\tgo.opentelemetry.io/collector/exporter@v0.104.0/internal/queue/consumers.go:43"}
What's strange is that when I use the otelcol_processor_refused_spans_total metric:
otelcol_processor_refused_spans_total{cluster_name="orion", service_name="otelcol-contrib"} @1721123498
I get some results:
otelcol_processor_refused_spans_total{cluster_name="orion",instance=":8888",job="otel-agent",processor="memory_limiter",service_instance_id="6bXXXXXX-fXXX-4XXX-aXXX-e1fXXXXXXXXX",service_name="otelcol-contrib",service_version="0.104.0"} 38111
Although this metric may work for alerting I would ideally expect to get results from the more specific otelcol_processor_dropped_spans_total metric.
What am I missing ?