open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
2.91k stars 2.27k forks source link

[Loadbalancingexporter] - Seeing a lot less traces received on Tier 2 collectors after moving to v0.94.0 #31274

Open aishyandapalli opened 7 months ago

aishyandapalli commented 7 months ago

Component(s)

No response

Describe the issue you're reporting

We have two collectors tier 1 and tier 2. Tier 1 collectors have a loadbalancingexporter configured for exporting spans to Tier 2 collectors.

We are monitoring the number of spans received on tier 1 and tier 2 collectors using the metric otelcol_receiver_accepted_spans

When we upgrade all our opentelemetry libraries to v0.94.0, I am seeing a huge mismatch b/w the number of spans accepted on tier 1 vs tier 2. It worked fine till v0.93.0. Can someone please help?

github-actions[bot] commented 6 months ago

Pinging code owners for exporter/loadbalancing: @jpkrohling. See Adding Labels via Comments if you do not have permissions to add labels yourself.

crobert-1 commented 6 months ago

I think this is a result of https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/30141, which aggregates exported data to make it more efficient. This likely reduced the output numbers of the metric you're referencing, meaning there's no data actually being dropped or missed.

Are you missing any data you expected? Is there any reason for concern, in addition to the otelcol_receiver_accepted_spans metric being lower? Just want to make sure data isn't actually being dropped here.

jpkrohling commented 6 months ago

The number of spans should have remained stable, batching it should reduce the number of requests but not the amount of data.

Would you be able to share the relevant parts of your config? Bonus points if you can share a telemetrygen command that exemplifies the problem.

edwintye commented 5 months ago

Just adding some more info here for others who may have found this issue. This may help other people investigate while we are also trying to figure out stuff internally.

We are seeing the same pattern and only in certain clusters where we have heavy traffic, in the region of 5 millions of spans per second. We also suspect it is the change from 0.93.0 to 0.94.0 on the named PR because the there are only 2 metrics where we can spot an obvious difference: otelcol_loadbalancer_num_resolutions, and otelcol_loadbalancer_backend_outcome. Attaching the 7 days graph, where the switch to 0.94.0 was done prior to our normal peak traffic starting just after 00:00-ish. Then we see basically a flat line for otelcol_loadbalancer_backend_outcome while at the same time otelcol_loadbalancer_num_resolutions reflects almost perfectly with the number of pods we have at layer-2. There is also very evident (cpu) throttling happening while it is not shipping, and there appears to be "hotspots" across pods, i.e. some pods were doing fine while others had high cpu/mem usage while not shipping.

load_balancing_outcome

Note that we have autoscaling in both layer-1 and layer-2 hence you see these massive fluctuations in number of backends, and we have double checked and is correctly matched to the number of pods in layer-2. (Yes, we have that many pods in layer-2.)

load_balancing_num_backend

One thing which we have not tried - at the time of writing - is to switch the resolver from k8 to dns and against the headless service of layer-2. It is not obvious this is the cause but the limited amount of profiling we managed to do while trying to restore service did show heavy mem usage on k8s api machinery. Unfortunately, I am not too familiar with how the different resolver behaves and where/when does that triggered to get a new list of endpoints.

memory_profile_93_to_93

Also worthy to mention is that we are not seeing the same behaviour in other clusters that has a stable throughput at just under 1 million spans a second. Furthermore, we have in fact upgraded all the way to version 0.97.0 in other clusters that does not go through heavy burst cycles and have found an improve in (total) cpu usage between 10% to 20%.

I will see if I can replicate this with telemetrygen locally when I get some time but that may not be any time soon, or even worthwhile if having a static list of target does not trigger this behaviour.

Edit: Happy to take any guidance from the maintainers on what we can look out for or adjustment in settings for the next time we try an upgrade :)

edwintye commented 4 months ago

After quite a bit of experimentation and trying to understand the new behaviour, we have successfully upgrade to 0.97.0 everywhere. Turns out the biggest problem was the batch size in the batch processor in the pipeline that is just in front of the load balancing exporter. We have our batch processor as the last step of the processors pipeline. The default setting (commented out part)

processors:
  batch:
    # the two settings below are the defaults
    # send_batch_max_size: 0
    # send_batch_size: 8192

can lead to a scenario where the batch is very large under heavy load as it is not bounded. Lowering both send_batch_size and send_batch_max_size has allowed us to return to an acceptable cpu/memory and the number of export back to normal. We have yet to find a "perfect" max batch size; our current max size is in the region of 4000 and collector is working fine.

I am not familiar with the codebase but from the limited understanding after looking at the code, it feels like the aforementioned PR changed the interaction from:

  1. Original batch -> ship in a FIFO manner.

to

  1. Original batch -> Allocate and split into new batches according to the number of endpoints -> Ship new batches.

This change means that now we would need 2x the previous memory (in the worst case) and also have the possibility that the spans are stuck in the exporter for way longer than desired as it is trying to re-batch and fails to handle the continuous pressure via the receiver. I am also not sure whether a change in the backends - layer-2 pod terminated - during this re-batch process will cause additional issues.

During our debugging, we also discover a potential missing tag in the metrics. Namely, https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/27840d58551a1887833e26a1af279aa667cf46ca/exporter/loadbalancingexporter/trace_exporter.go#L122-L123

suggests that success should be available but it is not defined in the histogram https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/27840d58551a1887833e26a1af279aa667cf46ca/exporter/loadbalancingexporter/metrics.go#L53-L61 while it is in the corresponding count https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/27840d58551a1887833e26a1af279aa667cf46ca/exporter/loadbalancingexporter/metrics.go#L62-L71 but to be honest these metrics didn't provide much insight in this particular occasion.

MasterExecuson commented 4 months ago

Hey @edwintye, we are observing the same issue as you. I wanted to also perform the Collector profilling, how did you manage to get that working? 🤔

jpkrohling commented 4 months ago

I have a feeling that this is fixed in v0.101.0, as part of another issue reporting a memory issue with the exporter. Once that version is out (should be out tomorrow, 21 May 2024), could you give it a try and report back?

gsaxena-au commented 3 months ago

Hey @edwintye , side question - how did you profile the OTel collector? I see you are sending the profile data to Grafana ?

edwintye commented 2 months ago

Sorry I didn't see the pings earlier. We run pyroscope and does a "pull" via grafana agent/alloy. The general setup is to add the pprof extension to the collector

extensions:
  pprof:
    endpoint: ${POD_IP}:6060
    block_profile_fraction: 0
    mutex_profile_fraction: 0
service:
  extensions: [pprof]

and then the pull config

  pyroscope.scrape "otel_collector" {
      targets    = <YOUR_OTEL_COLLECTOR_ADDRESS> # we get this from a k8s discovery
      forward_to = [pyroscope.write.agent.receiver] # needs to be defined
      profiling_config {
        profile.goroutine {
          enabled = true
          path = "/debug/pprof/goroutine"
          delta = false
        }
        profile.process_cpu {
          enabled = true
          path = "/debug/pprof/profile"
          delta = true
        }
        profile.godeltaprof_memory {
          enabled = false
          path = "/debug/pprof/delta_heap"
        }
        profile.memory {
          enabled = true
          path = "/debug/pprof/heap"
          delta = false
        }
        profile.godeltaprof_mutex {
          enabled = false
          path = "/debug/pprof/delta_mutex"
        }
        profile.mutex {
          enabled = false
          path = "/debug/pprof/mutex"
          delta = false
        }
        profile.godeltaprof_block {
          enabled = false
          path = "/debug/pprof/delta_block"
        }
        profile.block {
          enabled = false
          path = "/debug/pprof/block"
          delta = false
        }
      }
    }

which ships to the backend. Then we do various analysis on the cpu/memory/object usage and adjust our settings to fit the shape of our workload.

We have since upgraded to v0.103.0 which contains the change on merging traces. So far, scoped to layer-1 and under significant load (million spans per second), we are looking at cutting the amount of usage by 60% for both cpu and memory. The number of objects (from our live profiles during an upgrade) under batchprocess changed from about 80% to around 20%. Furthermore, the fraction of time spent in mergeTraces went from about 30% to essentially 0.

Note that our improvements here may not be representative since we have done a fair amount of adjustments to get ourselves past the initial upgrade on the load balancing exporter. From the load balancing exporter perspective, our observation so far there has only been benefit by upgrading to and past v0.101.0 as jpkrohling suggested.

github-actions[bot] commented 2 weeks ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.