open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
3.1k stars 2.39k forks source link

Otel Agent Collector is not showing correct value for dropped spans metrics #34279

Open edtshuma opened 3 months ago

edtshuma commented 3 months ago

Component(s)

No response

Describe the issue you're reporting

I have an OTEL Collector instance deployed in Gateway mode. When I query the metric for dropped spans (Grafana Explore menu) I get no data even though I experienced dropped spans at that exact timestamp. I would like to alert on "dropped span" events and for that I am starting with the following query:

otelcol_processor_dropped_spans_total{cluster_name="orion", service_name="otelcol-contrib"} @1721123498

but the query returns a count of 0:

otelcol_processor_dropped_spans_total{cluster_name="orion",instance=":8888",job="otel-agent",processor="memory_limiter",service_instance_id="6b4xxxxx-fxxx-4xxx-axxx-e1fxxxxxxxxx",service_name="otelcol-contrib",service_version="0.104.0"} 0

The OTEL Gateway receives spans, logs and metrics exported by agents running on multiple K8s clusters. On one of the K8s clusters I experienced data loss on a traces pipeline. Using LogQL I can confirm the dropped spans as below:

{namespace="monitoring", app="opentelemetry-collector", cluster_name="orion"} | json | level=~"error|warn" | ts=~"^1721123498.*"

and the output:

{"level":"error","ts":1721123498.01548,"caller":"exporterhelper/queue_sender.go:90","msg":"Exporting failed. Dropping data.","kind":"exporter","data_type":"traces","name":"zipkin/tempo","error":"no more retries left: failed the request with status code 413","dropped_items":2393,"stacktrace":"go.opentelemetry.io/collector/exporter/exporterhelper.newQueueSender.func1\n\tgo.opentelemetry.io/collector/exporter@v0.104.0/exporterhelper/queue_sender.go:90\ngo.opentelemetry.io/collector/exporter/internal/queue.(*boundedMemoryQueue[...]).Consume\n\tgo.opentelemetry.io/collector/exporter@v0.104.0/internal/queue/bounded_memory_queue.go:52\ngo.opentelemetry.io/collector/exporter/internal/queue.(*Consumers[...]).Start.func1\n\tgo.opentelemetry.io/collector/exporter@v0.104.0/internal/queue/consumers.go:43"}

What's strange is that when I use the otelcol_processor_refused_spans_total metric:

otelcol_processor_refused_spans_total{cluster_name="orion", service_name="otelcol-contrib"} @1721123498

I get some results:

otelcol_processor_refused_spans_total{cluster_name="orion",instance=":8888",job="otel-agent",processor="memory_limiter",service_instance_id="6bXXXXXX-fXXX-4XXX-aXXX-e1fXXXXXXXXX",service_name="otelcol-contrib",service_version="0.104.0"} 38111

Although this metric may work for alerting I would ideally expect to get results from the more specific otelcol_processor_dropped_spans_total metric.

What am I missing ?

github-actions[bot] commented 1 month ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

atoulme commented 1 month ago

The status code returned is the clue here. 413 means entity too large. The spans were explicitly refused. This is the correct behavior.