Closed MarcinGinszt closed 1 week ago
Pinging code owners for exporter/kafka: @pavolloffay @MovieStoreGuy. See Adding Labels via Comments if you do not have permissions to add labels yourself.
We are continuing to encounter this issue once- twice daily. It happens only on our environment with biggest traffic. It's not a matter of resource consumption- resource usage is around 50% of Kubernetes limits. We have three collector pods running- it affects any number of them (1, 2 or 3) simultaneously (e.g. - two pods are stopping to produce at the exact same moment, third one works as usual). Restarting the pods fixes the situation.
There is nothing in the logs (debug level).
We analyzed the pprof profiles and goroutines graphs for working and broken collector- broken collector doesn't run sarama producer process.
I'm attaching the profile graphs here (for ok
and broken
collectors- for comparison)
Ok profile: Broken profile:
Ok goroutine: Broken goroutine:
EDIT: this doesn't happen for each broken pod- most of them don't record any error
We found some error with the debug/tracez
endpoint:
looks like publishing process silently terminates because of context deadline exceeded in
opentelemetry.proto.collector.trace.v1.traceservice/export
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers
. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.
Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself.
We're encountering this same issue using:
otel/opentelemetry-collector-k8s:0.102.1
No errors logged other than the send queue being full.
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers
. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.
Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself.
This issue has been closed as inactive because it has been stale for 120 days with no activity.
Component(s)
No response
What happened?
Description
Otel-collector randomly stops sending spans. We encountered this situation twice this week. It happens to just one of the collector pods, the rest works correctly. We are notified by alert about sending queue being full- after inspecting pod metrics, turns out that it is caused by otelcol_exporter_sent_spans dropping to 0.
There is nothing in the logs before the error about sending queue being full.
Are there some additional ways to diagnose the issue before resorting to pprof?
Steps to Reproduce
Expected Result
Actual Result
Collector version
0.95.0
Environment information
Environment
https://github.com/utilitywarehouse/opentelemetry-manifests
OpenTelemetry Collector configuration
Log output
No response
Additional context
No response