open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
3.02k stars 2.33k forks source link

Discrepancy between sending and receiving traffic in otel-agent and otel-collector #19760

Closed mizhexiaoxiao closed 6 months ago

mizhexiaoxiao commented 1 year ago

Component(s)

exporter/loadbalancing

Describe the issue you're reporting

I am currently experiencing a discrepancy between the amount of traffic being sent by otel-agent and the amount being received by otel-collector. For some reason, the receiving traffic in otel-collector is significantly lower than the sending traffic in otel-agent.

I have considered a few possible reasons for this, including sampling rates, configuration issues, and compression algorithms. However, I am having trouble pinpointing the exact cause of the problem.

Could someone please provide some guidance on how to troubleshoot this issue? Additionally, could you please let me know if there are any other factors that could be contributing to the discrepancy between the sending and receiving traffic?

otel-agent uses loadbalancingexporter and otel-collector uses otlp receiver

Any help would be greatly appreciated. Thanks!

github-actions[bot] commented 1 year ago

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

srikanthccv commented 1 year ago

For some reason, the receiving traffic in otel-collector is significantly lower than the sending traffic in otel-agent.

How significant is this? Are you looking at the collector's internal metrics to arrive at this conclusion? Does the combined sent rate of exporting agents not match receiving rate of the combined receive rate of receiver collectors?

atoulme commented 1 year ago

How do you currently measure traffic?

mizhexiaoxiao commented 1 year ago

After looking into it, I couldn't find any specific metrics related to network traffic within the otel-collector. If I missed something, please feel free to correct me. However, I can confirm that the sending rate of otel-agent is matching the receiving rate of otel-collector, with all agents send about as many spans as the collector receives.

The following is the number of spans received by all agents and the number of spans sent by collectors at the same time

image

To measure the traffic, we've been using the Prometheus node-exporter's container_network_receive_packets_total and container_network_transmit_packets_total metrics. Based on these metrics, we've observed that the receiving traffic in otel-collector is significantly lower than the sending traffic in otel-agent.

image image

This is just a comparison between a single agent and a collector. In fact, the total sending traffic of the agent is 40 times the receiving traffic of the collector. @atoulme @srikanthccv

atoulme commented 1 year ago

OK. Do you have a set up that would allow us to reproduce?

mizhexiaoxiao commented 1 year ago

@atoulme Yes, the configuration we are using is as follows

image: otel/opentelemetry-collector-contrib:0.73.0

otel agent config

receivers:
  jaeger:
    protocols:
      thrift_compact:
        endpoint: 0.0.0.0:6831
        queue_size: 5_000
        max_packet_size: 131_072
        workers: 50
        socket_buffer_size: 8_388_608
      thrift_binary:
        endpoint: 0.0.0.0:6832
        queue_size: 5_000
        max_packet_size: 131_072
        workers: 50
        socket_buffer_size: 8_388_608
  zipkin:
exporters:
  loadbalancing:
    protocol:
      otlp:
        timeout: 1s
        tls:
          insecure: true
    resolver:
      static:
        hostnames:
        - otel-collector-0.otel-collector.trace.svc.cluster.local:4317
        - otel-collector-1.otel-collector.trace.svc.cluster.local:4317
        - otel-collector-2.otel-collector.trace.svc.cluster.local:4317
        - otel-collector-3.otel-collector.trace.svc.cluster.local:4317
        - otel-collector-4.otel-collector.trace.svc.cluster.local:4317
processors:
  memory_limiter:
    check_interval: 1s
    limit_percentage: 90
    spike_limit_percentage: 80
extensions:
  zpages:
service:
  extensions: [zpages]
  pipelines:
    traces:
      receivers: [jaeger, zipkin]
      processors: [memory_limiter]
      exporters: [loadbalancing]

otel collector config

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
processors:
  tail_sampling:
    decision_wait: 30s
    num_traces: 200000
    policies:
      [
        # some policies
      ]
  batch:
exporters:
    alibabacloud_logservice/sls-traces:
service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [tail_sampling, batch]
      exporters: [alibabacloud_logservice/sls-traces]
atoulme commented 1 year ago

OK great, looks like everything needed to reproduce. @jpkrohling would you like to please take a look?

jpkrohling commented 1 year ago

Are you experiencing this only on the latest version of the collector? If this isn't a regression, I wouldn't block the release because of this.

mizhexiaoxiao commented 1 year ago

I have tried other versions as well, such as 0.63.0, and the issue persists. Is this normal behavior?

jpkrohling commented 1 year ago

I need to dig into this issue, but I would expect the number of received spans to equal the number of spans exported if sampling isn't being performed and if data is being sent to only one exporter at a time.

github-actions[bot] commented 1 year ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions[bot] commented 1 year ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions[bot] commented 11 months ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions[bot] commented 9 months ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions[bot] commented 6 months ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.