open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
3.07k stars 2.37k forks source link

[spanmetrics] spanmetrics processor throughput is around 300 #16231

Closed wisre closed 1 year ago

wisre commented 2 years ago

Component(s)

processor/spanmetrics

Describe the issue you're reporting

when i use spanmetrics plugin, i found processor throughput Dropped from 2000 to 800, which lead agent packet dropped.

otel-col config

extensions:
  health_check:
    endpoint: "0.0.0.0:8777"
  pprof:
    endpoint: "0.0.0.0:8778"
  zpages:
    endpoint: "0.0.0.0:8779"

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: ":5317"
      http:
        endpoint: ":5318"

  jaeger:
    protocols:
      grpc:
        endpoint: "0.0.0.0:16250"
        read_buffer_size: 524288
        max_concurrent_streams: 20
      thrift_binary:
        endpoint: "0.0.0.0:7832"
      thrift_compact:
        endpoint: "0.0.0.0:7831"
      thrift_http:
        endpoint: "0.0.0.0:16268"
  otlp/spanmetrics:
    protocols:
      grpc:
        endpoint: "localhost:12345"

processors:
  batch:
    send_batch_size: 1000
    timeout: 3s
  memory_limiter:
    check_interval: 5s
    limit_mib: 10240
    spike_limit_mib: 2048
  spanmetrics:
    metrics_exporter: prometheus
    latency_histogram_buckets: [100us, 1ms, 2ms, 6ms, 10ms, 100ms, 250ms]
    dimensions_cache_size: 200000
    aggregation_temporality: "AGGREGATION_TEMPORALITY_CUMULATIVE"

  tail_sampling:
    decision_wait: 10s
    policies:
      [
        {
           name: test-policy-1,
           type: latency,
           latency: {threshold_ms: 10}
        },
       {
           name: test-policy-2,
           type: probabilistic,
           probabilistic: {sampling_percentage: 100}
        },
      ]

exporters:
  prometheus:
    endpoint: "0.0.0.0:8889"

  jaeger:
    tls:
      insecure: true
    endpoint: "xxx:14250"
    balancer_name: "round_robin"
    timeout: 3s
    sending_queue:
      enabled: true
      num_consumers: 40
      queue_size: 10000
    retry_on_failure:
      enabled: false
      initial_interval: 10s
      max_interval: 60s
      max_elapsed_time: 10m

service:
  pipelines:
    traces:
      receivers: [otlp,jaeger]
      processors: [ spanmetrics,batch]
      exporters: [jaeger]

    metrics:
      receivers: [otlp/spanmetrics]
      exporters: [prometheus]

  telemetry:
    metrics:
      address: "0.0.0.0:9889"

throughput trend

image

i read code, found that spanmetrics consumer func may be wait until trace compute over?

image
github-actions[bot] commented 2 years ago

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions[bot] commented 1 year ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

fatsheep9146 commented 1 year ago

does the problem still exists? @wisre

albertteoh commented 1 year ago

i read code, found that spanmetrics consumer func may be wait until trace compute over?

That's correct, it will wait until both trace and metrics processing and exports are done before returning.

https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/17307 will change how this works and should make it "faster" by emitting metrics out-of-band, while trace handling is still in-band.

kovrus commented 1 year ago

@albertteoh shall we close this issue now? @wisre do you experience the same behavior with the latest version of spanmetrics processor?

albertteoh commented 1 year ago

Yes, I expect this problem should be resolved by https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/17307, but would be great for @wisre to confirm.

Either way, I don't have permission to close tickets.

github-actions[bot] commented 1 year ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

fatsheep9146 commented 1 year ago

ping @wisre

github-actions[bot] commented 1 year ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions[bot] commented 1 year ago

This issue has been closed as inactive because it has been stale for 120 days with no activity.