Open narasimharaojm opened 1 year ago
Observed new panic runtime error:
panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x35a0519]
goroutine 241 [running]: go.opentelemetry.io/collector/pdata/ptrace.ResourceSpans.Resource(...) go.opentelemetry.io/collector/pdata@v1.0.0-rcv0011/ptrace/generated_resourcespans.go:58 github.com/open-telemetry/opentelemetry-collector-contrib/processor/tailsamplingprocessor/internal/sampling.hasResourceOrSpanWithCondition({0x65f9b01?}, 0xc000b50a60, 0xc000b50a78?) github.com/open-telemetry/opentelemetry-collector-contrib/processor/tailsamplingprocessor@v0.76.3/internal/sampling/util.go:32 +0x59 github.com/open-telemetry/opentelemetry-collector-contrib/processor/tailsamplingprocessor/internal/sampling.(stringAttributeFilter).Evaluate(0xc000b58750, {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...}, ...) github.com/open-telemetry/opentelemetry-collector-contrib/processor/tailsamplingprocessor@v0.76.3/internal/sampling/string_tag_filter.go:135 +0x130 github.com/open-telemetry/opentelemetry-collector-contrib/processor/tailsamplingprocessor/internal/sampling.(And).Evaluate(0xc000000002?, {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...}, ...) github.com/open-telemetry/opentelemetry-collector-contrib/processor/tailsamplingprocessor@v0.76.3/internal/sampling/and.go:44 +0x6d github.com/open-telemetry/opentelemetry-collector-contrib/processor/tailsamplingprocessor.(tailSamplingSpanProcessor).makeDecision(0xc000e15860, {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...}, ...) github.com/open-telemetry/opentelemetry-collector-contrib/processor/tailsamplingprocessor@v0.76.3/processor.go:230 +0x1c4 github.com/open-telemetry/opentelemetry-collector-contrib/processor/tailsamplingprocessor.(tailSamplingSpanProcessor).samplingPolicyOnTick(0xc000e15860) github.com/open-telemetry/opentelemetry-collector-contrib/processor/tailsamplingprocessor@v0.76.3/processor.go:187 +0x1a9 github.com/open-telemetry/opentelemetry-collector-contrib/internal/coreinternal/timeutils.(PolicyTicker).OnTick(...) github.com/open-telemetry/opentelemetry-collector-contrib/internal/coreinternal@v0.76.3/timeutils/ticker_helper.go:56 github.com/open-telemetry/opentelemetry-collector-contrib/internal/coreinternal/timeutils.(PolicyTicker).Start.func1() github.com/open-telemetry/opentelemetry-collector-contrib/internal/coreinternal@v0.76.3/timeutils/ticker_helper.go:47 +0x2e created by github.com/open-telemetry/opentelemetry-collector-contrib/internal/coreinternal/timeutils.(*PolicyTicker).Start github.com/open-telemetry/opentelemetry-collector-contrib/internal/coreinternal@v0.76.3/timeutils/ticker_helper.go:43 +0xb0
Pinging code owners for processor/tailsampling: @jpkrohling. See Adding Labels via Comments if you do not have permissions to add labels yourself.
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers
. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.
Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself.
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers
. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.
Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself.
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers
. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.
Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself.
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers
. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.
Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself.
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers
. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.
Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself.
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers
. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.
Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself.
Component(s)
tailsamplingprocessor
What happened?
Observing back pressure in loadbalancing exporter due to instability with tail sampling processor.
Description
Observing back pressure in loadbalancing exporter due to instability with tail sampling processor.
As per the config option num_traces in tail_sampling, tail sampling processor allocates a memory for specified number of traces. As long as tail sampling processor does not hit the limit of num_traces, traces data gets ingested from loadbalancing exporter, get's sampled in tail sampling processor and gets exported to backend. However when tail sampling processor hits the limit of num_traces, loadbalancing exporter is experiencing the connection issues with tail sampling processing cluster.
Sample error observed in loadbalancing exporter layer when tail sampling processing layer hits the limit of num_traces.
2023-05-12T14:20:47.050-0700 error exporterhelper/queued_retry.go:367 Exporting failed. Try enabling retry_on_failure config option to retry on retryable errors {"kind": "exporter", "data_type": "traces", "name": "loadbalancing", "error": "rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 10.176.27.219:4317: connect: connection refused\""} go.opentelemetry.io/collector/exporter/exporterhelper.(retrySender).send go.opentelemetry.io/collector/exporter@v0.76.1/exporterhelper/queued_retry.go:367 go.opentelemetry.io/collector/exporter/exporterhelper.(tracesExporterWithObservability).send go.opentelemetry.io/collector/exporter@v0.76.1/exporterhelper/traces.go:137 go.opentelemetry.io/collector/exporter/exporterhelper.(queuedRetrySender).start.func1 go.opentelemetry.io/collector/exporter@v0.76.1/exporterhelper/queued_retry.go:205 go.opentelemetry.io/collector/exporter/exporterhelper/internal.(boundedMemoryQueue).StartConsumers.func1 go.opentelemetry.io/collector/exporter@v0.76.1/exporterhelper/internal/bounded_memory_queue.go:58
I have tried to increase the limit of num_traces to higher but eventually tail sampling processor is getting caught up with the limit and creating back pressure in loadbalancing exporter cluster i.e., LB exporter is experiencing connection refusal errors from tail sampling processing cluster. I also verified the health of nodes in tail sampling cluster and are healthy.
I have attached couple of screen shots where it can be observed that when the traces in memory hits the num_traces limit, increase in traces send failed rate correlated in loadbalancing exporter.
Currently we are processing data ingest rate at ~8M spans/minute and ~2.5M traces/min
tail sampling initial config in tail sampling processing cluster - tail_sampling: decision_wait: 60s num_traces: 20000000 expected_new_traces_per_sec: 20000
memory limit config in tail sampling processing cluster - memory_limiter: check_interval: 2s limit_mib: 50000 spike_limit_mib: 10000
load balancing exporter config in LB cluster - loadbalancing: protocol: otlp: timeout: 1s tls: insecure: true sending_queue: enabled: true num_consumers: 100 queue_size: 2000000 retry_on_failure: enabled: false resolver: dns: hostname: tail-sampling-dns-name
Collector version
v0.76.1
Environment information
Environment
OS: Ubuntu 20.04
OpenTelemetry Collector configuration
Log output
Additional context
No response