open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
2.92k stars 2.28k forks source link

Instability with tailsampling processor is creating back pressure into loadbalancing exporter #21903

Open narasimharaojm opened 1 year ago

narasimharaojm commented 1 year ago

Component(s)

tailsamplingprocessor

What happened?

Observing back pressure in loadbalancing exporter due to instability with tail sampling processor.

Description

Observing back pressure in loadbalancing exporter due to instability with tail sampling processor.

As per the config option num_traces in tail_sampling, tail sampling processor allocates a memory for specified number of traces. As long as tail sampling processor does not hit the limit of num_traces, traces data gets ingested from loadbalancing exporter, get's sampled in tail sampling processor and gets exported to backend. However when tail sampling processor hits the limit of num_traces, loadbalancing exporter is experiencing the connection issues with tail sampling processing cluster.

Sample error observed in loadbalancing exporter layer when tail sampling processing layer hits the limit of num_traces.

2023-05-12T14:20:47.050-0700 error exporterhelper/queued_retry.go:367 Exporting failed. Try enabling retry_on_failure config option to retry on retryable errors {"kind": "exporter", "data_type": "traces", "name": "loadbalancing", "error": "rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 10.176.27.219:4317: connect: connection refused\""} go.opentelemetry.io/collector/exporter/exporterhelper.(retrySender).send go.opentelemetry.io/collector/exporter@v0.76.1/exporterhelper/queued_retry.go:367 go.opentelemetry.io/collector/exporter/exporterhelper.(tracesExporterWithObservability).send go.opentelemetry.io/collector/exporter@v0.76.1/exporterhelper/traces.go:137 go.opentelemetry.io/collector/exporter/exporterhelper.(queuedRetrySender).start.func1 go.opentelemetry.io/collector/exporter@v0.76.1/exporterhelper/queued_retry.go:205 go.opentelemetry.io/collector/exporter/exporterhelper/internal.(boundedMemoryQueue).StartConsumers.func1 go.opentelemetry.io/collector/exporter@v0.76.1/exporterhelper/internal/bounded_memory_queue.go:58

I have tried to increase the limit of num_traces to higher but eventually tail sampling processor is getting caught up with the limit and creating back pressure in loadbalancing exporter cluster i.e., LB exporter is experiencing connection refusal errors from tail sampling processing cluster. I also verified the health of nodes in tail sampling cluster and are healthy.

I have attached couple of screen shots where it can be observed that when the traces in memory hits the num_traces limit, increase in traces send failed rate correlated in loadbalancing exporter.

Currently we are processing data ingest rate at ~8M spans/minute and ~2.5M traces/min

tail sampling initial config in tail sampling processing cluster - tail_sampling: decision_wait: 60s num_traces: 20000000 expected_new_traces_per_sec: 20000

memory limit config in tail sampling processing cluster - memory_limiter: check_interval: 2s limit_mib: 50000 spike_limit_mib: 10000

load balancing exporter config in LB cluster - loadbalancing: protocol: otlp: timeout: 1s tls: insecure: true sending_queue: enabled: true num_consumers: 100 queue_size: 2000000 retry_on_failure: enabled: false resolver: dns: hostname: tail-sampling-dns-name

Collector version

v0.76.1

Environment information

Environment

OS: Ubuntu 20.04

OpenTelemetry Collector configuration

extensions:
  health_check:
  zpages:
    endpoint: 0.0.0.0:55679
  # memory_ballast:
  #   size_mib: 200000

receivers:
  prometheus:
    config:
      scrape_configs:
        - job_name: 'otel-collector'
          scrape_interval: 10s
          static_configs:
            - targets: ['localhost:8888']
              labels:
                environment: qa
                host: ${host_name}
          metric_relabel_configs:
            - source_labels: [ __name__ ]
              regex: '.*grpc_io.*'
              action: drop
  otlp/tail_sampling:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317

processors:
  metricstransform/otelcol:
    transforms:
      - include: ^otelcol.*
        match_type: regexp
        action: update
        operations:
         - action: add_label
           new_label: az
           new_value: gcp_qa
         - action: add_label
           new_label: environment
           new_value: qa
  batch:
    # timeout: 1s
    # send_batch_size: 10000
    # send_batch_max_size: 11000
  attributes/dropunused_tags:
    actions:
      - key: host.arch
        action: delete
      - key: http.flavor
        action: delete
      - key: http.scheme
        action: delete
      - key: http.user_agent
        action: delete
      - key: net.peer.ip
        action: delete
      - key: net.peer.port
        action: delete
      - key: net.transport
        action: delete
      - key: os.description
        action: delete
      - key: os.type
        action: delete
      - key: process.command_line
        action: delete
      - key: process.executable.path
        action: delete
      - key: process.pid
        action: delete
      - key: process.runtime.description
        action: delete
      - key: process.runtime.name
        action: delete
      - key: process.runtime.version
        action: delete
      - key: signalfx.smartagent.version
        action: delete
      - key: splunk.distro.version
        action: delete
      - key: telemetry.sdk.language
        action: delete
      - key: telemetry.sdk.name
        action: delete
  attributes/newenvironment:
    actions:
      - key: environment
        value: "qa"
        action: insert
  memory_limiter:
    check_interval: 2s
    limit_mib: 50000
    spike_limit_mib: 10000
  tail_sampling:
    decision_wait: 60s
    num_traces: 20000000
    expected_new_traces_per_sec: 20000
    policies:
      [
        {
          name: sampling_policy_on_traces_with_span_http_errors,
          type: numeric_attribute,
          numeric_attribute: {key: http.status_code, min_value: 400, max_value: 599}
        },
        {
          name: sampling_policy_on_traces_with_span_string_http_errors,
          type: string_attribute,
          string_attribute: {key: http.status_code, values: [4*, 5*], enabled_regex_matching: true}
        },
        {
          name: sampling_policy_on_traces_with_span_string_otel_status_errors,
          type: and,
          and: {
            and_sub_policy:
            [
              {
                  name: sampling_policy_on_traces_with_span_string_otel_status_errors_1,
                  type: string_attribute,
                  string_attribute: {key: otel.status_code, values: [ERROR, error], enabled_regex_matching: true}
              },
              {
                name: sampling_policy_on_traces_with_span_string_server_otel_status_errors,
                type: string_attribute,
                string_attribute: {key: span.kind, values: [SERVER, server, SPAN_KIND_SERVER, consumer, CONSUMER, SPAN_KIND_CONSUMER], enabled_regex_matching: true}
              },
            ]
          }
        },
        {
          name: sampling_policy_on_traces_with_span_cal_status_errors,
          type: numeric_attribute,
          numeric_attribute: {key: status, min_value: 1, max_value: 200}
        },
        {
          name: sampling_policy_on_traces_with_span_string_cal_status_errors,
          type: string_attribute,
          string_attribute: {key: status, values: [1, 2, 3, 4, 5, 6], enabled_regex_matching: true}
        },
        {
          name: sampling_policy_on_traces_with_span_string_errors,
          type: and,
          and: {
            and_sub_policy:
            [
              {
                  name: sampling_policy_on_traces_with_span_string_errors_1,
                  type: string_attribute,
                  string_attribute: {key: error, values: ["true"], enabled_regex_matching: true}
              },
              {
                name: sampling_policy_on_traces_with_span_server_string_errors,
                type: string_attribute,
                string_attribute: {key: span.kind, values: [SERVER, server, SPAN_KIND_SERVER, consumer, CONSUMER, SPAN_KIND_CONSUMER], enabled_regex_matching: true}
              },
            ]
          }
        },
        {
          name: sampling_policy_on_traces_with_span_boolean_errors,
          type: and,
          and: {
            and_sub_policy:
            [
              {
                  name: sampling_policy_on_traces_with_span_boolean_errors_1,
                  type: boolean_attribute,
                  boolean_attribute: {key: error, value: true}
              },
              {
                name: sampling_policy_on_traces_with_span_server_boolean_errors,
                type: string_attribute,
                string_attribute: {key: span.kind, values: [SERVER, server, SPAN_KIND_SERVER, consumer, CONSUMER, SPAN_KIND_CONSUMER], enabled_regex_matching: true}
              },
            ]
          }
        }
      ]

exporters:
  jaeger:
    timeout: 12s
    endpoint: "endpoint"
    balancer_name: "round_robin"
    tls:
      insecure: true
    sending_queue:
      enabled: true
      num_consumers: 100
      queue_size: 10000000
    retry_on_failure:
      enabled: false
      initial_interval: 10s
      max_interval: 60s
      max_elapsed_time: 10m

  signalfx/tail_sampling:
    access_token: "token"
    ingest_url: "endpoint"
    api_url: "endpoint"
    translation_rules:
    - action: drop_dimensions
      metric_name: otelcol*
      dimension_pairs:
        service_instance_id:
        cloud.availability_zone:
        gcp_id:
        host.id:
        host.name:
        instance:
        cloud.account.id:
        cloud.platform:
        cloud.provider:
        host.type:
        http.scheme:
        job:
        service.instance.id:
        net.host.port:
        os.type:
        port:
        service_name:
        scheme:
        transport:

service:
  pipelines:
    traces:
      receivers: [otlp/tail_sampling]
      processors: [memory_limiter, batch, tail_sampling]
      exporters: [jaeger]
    metrics:
      receivers: [prometheus]
      processors: [memory_limiter,batch, metricstransform/otelcol]
      exporters: [signalfx/tail_sampling]
  extensions: [health_check, zpages]

Log output

2023-05-12T14:20:47.050-0700    error   exporterhelper/queued_retry.go:367  Exporting failed. Try enabling retry_on_failure config option to retry on retryable errors  {"kind": "exporter", "data_type": "traces", "name": "loadbalancing", "error": "rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 10.176.27.219:4317: connect: connection refused\""}
go.opentelemetry.io/collector/exporter/exporterhelper.(*retrySender).send
    go.opentelemetry.io/collector/exporter@v0.76.1/exporterhelper/queued_retry.go:367
go.opentelemetry.io/collector/exporter/exporterhelper.(*tracesExporterWithObservability).send
    go.opentelemetry.io/collector/exporter@v0.76.1/exporterhelper/traces.go:137
go.opentelemetry.io/collector/exporter/exporterhelper.(*queuedRetrySender).start.func1
    go.opentelemetry.io/collector/exporter@v0.76.1/exporterhelper/queued_retry.go:205
go.opentelemetry.io/collector/exporter/exporterhelper/internal.(*boundedMemoryQueue).StartConsumers.func1
    go.opentelemetry.io/collector/exporter@v0.76.1/exporterhelper/internal/bounded_memory_queue.go:58

Additional context

No response

narasimharaojm commented 1 year ago
Screen Shot 2023-05-12 at 2 28 32 PM Screen Shot 2023-05-12 at 2 28 49 PM
narasimharaojm commented 1 year ago

Observed new panic runtime error:

panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x35a0519]

goroutine 241 [running]: go.opentelemetry.io/collector/pdata/ptrace.ResourceSpans.Resource(...) go.opentelemetry.io/collector/pdata@v1.0.0-rcv0011/ptrace/generated_resourcespans.go:58 github.com/open-telemetry/opentelemetry-collector-contrib/processor/tailsamplingprocessor/internal/sampling.hasResourceOrSpanWithCondition({0x65f9b01?}, 0xc000b50a60, 0xc000b50a78?) github.com/open-telemetry/opentelemetry-collector-contrib/processor/tailsamplingprocessor@v0.76.3/internal/sampling/util.go:32 +0x59 github.com/open-telemetry/opentelemetry-collector-contrib/processor/tailsamplingprocessor/internal/sampling.(stringAttributeFilter).Evaluate(0xc000b58750, {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...}, ...) github.com/open-telemetry/opentelemetry-collector-contrib/processor/tailsamplingprocessor@v0.76.3/internal/sampling/string_tag_filter.go:135 +0x130 github.com/open-telemetry/opentelemetry-collector-contrib/processor/tailsamplingprocessor/internal/sampling.(And).Evaluate(0xc000000002?, {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...}, ...) github.com/open-telemetry/opentelemetry-collector-contrib/processor/tailsamplingprocessor@v0.76.3/internal/sampling/and.go:44 +0x6d github.com/open-telemetry/opentelemetry-collector-contrib/processor/tailsamplingprocessor.(tailSamplingSpanProcessor).makeDecision(0xc000e15860, {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...}, ...) github.com/open-telemetry/opentelemetry-collector-contrib/processor/tailsamplingprocessor@v0.76.3/processor.go:230 +0x1c4 github.com/open-telemetry/opentelemetry-collector-contrib/processor/tailsamplingprocessor.(tailSamplingSpanProcessor).samplingPolicyOnTick(0xc000e15860) github.com/open-telemetry/opentelemetry-collector-contrib/processor/tailsamplingprocessor@v0.76.3/processor.go:187 +0x1a9 github.com/open-telemetry/opentelemetry-collector-contrib/internal/coreinternal/timeutils.(PolicyTicker).OnTick(...) github.com/open-telemetry/opentelemetry-collector-contrib/internal/coreinternal@v0.76.3/timeutils/ticker_helper.go:56 github.com/open-telemetry/opentelemetry-collector-contrib/internal/coreinternal/timeutils.(PolicyTicker).Start.func1() github.com/open-telemetry/opentelemetry-collector-contrib/internal/coreinternal@v0.76.3/timeutils/ticker_helper.go:47 +0x2e created by github.com/open-telemetry/opentelemetry-collector-contrib/internal/coreinternal/timeutils.(*PolicyTicker).Start github.com/open-telemetry/opentelemetry-collector-contrib/internal/coreinternal@v0.76.3/timeutils/ticker_helper.go:43 +0xb0

github-actions[bot] commented 1 year ago

Pinging code owners for processor/tailsampling: @jpkrohling. See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions[bot] commented 1 year ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions[bot] commented 10 months ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions[bot] commented 8 months ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions[bot] commented 5 months ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions[bot] commented 2 months ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions[bot] commented 1 week ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.