open-telemetry / opentelemetry-collector

OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
4.46k stars 1.47k forks source link

Increased request latency discovered in 0.106.1 (> 0.105.0) #10858

Open arielvalentin opened 3 months ago

arielvalentin commented 3 months ago

Describe the bug

This issue was originally reported in CNCF slack by Srinivas Bevara and confirmed by @codeboten who asked me to open this issue

After Upgrade to v0.106.1 from v0.104.0 we have run into issues where we see an increase in request latency being reported by both the collector's internal telemetry and our ingress proxies.

Memory remains unaffected however pprof is reporting garbage collection seems to have increased due internal telemetry tracing related allocations:

Image

Image

Image

Image

This is unexpected since tracing telemetry is disabled in our deployments.

Steps to reproduce

Run a load test comparing v0.104.0 and v0.106.1

What did you expect to see?

What did you see instead?

What version did you use?

v0.106.1

What config did you use?

Our configuration has many transformations and connector pipelines. The bare minimum you may likely need is the one you use for benchmarking.

Environment

Ubuntu 20.04.1

Additional context

Screen shots of increased p99 latency and CPU and steady memory

Image

Image

Image

codeboten commented 3 months ago

After some discussion amongst maintainers, we want to provide a quick fix for this issue via the feature gate service.noopTracerProvider, and further investigate what's the cause of this problem before deciding if there should be a permanent configuration option to set a no-op tracer provider.

Returning a noop tracer provider if no tracing is configured in the telemetry settings would break the current behaviour, which the zpages extension relies on to provide telemetry about traces.

arielvalentin commented 3 months ago

Was there a recent change in zpages results in an increase in the number spans being allocated?

bvsvas commented 3 months ago

Thanks, @arielvalentin, for taking this up and raising a GitHub issue.

arielvalentin commented 3 months ago

@codeboten ~did this feature flag make it into v0.107.0?~

Yes it did https://github.com/open-telemetry/opentelemetry-collector/releases/tag/v0.107.0

arielvalentin commented 2 months ago

I recently upgraded to v0.107.0 from v0.104.0. During off peak times performance is about on-par but during our high peak times our p99 skyrockets and memory spikes forcing the memory limiter to trigger hard limits responding with 503s.

I have enabled the feature gate to ensure we are using the noop-tracer but there still seems to be an increase in memory allocations and latency.

Our heavy usage of OTTL has not changed much between versions and as far as I can tell that is what appears the most in the profiles along with a high number of context objects being allocated by the noop-tracer.

profiles-20240823.tar.gz

codeboten commented 2 months ago

Re-opening to continue the investigation

TylerHelmuth commented 2 months ago

@arielvalentin Did the OTTL performance degrade, or are you saying it remained the majority?

arielvalentin commented 2 months ago

The profiles revealed that in 104 seemed to show OTTL execute was where it spent most of its time and allocations but our performance was pretty good ~15K spans per second per collector. p99 19ms.

Other than seeing the increased allocations to context objects related to nooptracer, the profiles between 104 and 107 look about the same to me so I don't have a reason to believe there is a regression in OTTL performance.

I'll see if I can dig up my 104 profiles and attach them as well.

pavolloffay commented 2 months ago

From @arielvalentin

I have enabled the feature gate to ensure we are using the noop-tracer but there still seems to be an increase in memory allocations and latency.

Other than seeing the increased allocations to context objects related to nooptracer, the profiles between 104 and 107 look about the same to me so I don't have a reason to believe there is a regression in OTTL performance.

So can be service.noopTracerProvider considered as a workaround?

iblancasa commented 2 months ago

So can be service.noopTracerProvider considered as a workaround?

I would say "no" because

there still seems to be an increase in memory allocations and latency.

codeboten commented 2 months ago

Investigating this further, this might be related to https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/33508

codeboten commented 2 months ago

I just ran load tests for v0.107.0 and here's what i have when i enabled the noop tracer provider feature gate:

Before using --feature-gates service.noopTracerProvider

Screenshot 2024-08-26 at 11 39 03 AM

With --feature-gates service.noopTracerProvider

Screenshot 2024-08-26 at 11 38 00 AM

This seems different than what you're seeing @arielvalentin, can you confirm that the feature gate is enabled?

arielvalentin commented 2 months ago

Yes. We currently run it with --feature-gates service.noopTracerProvider

You can see in the profiles I attached the noopTracer is used:

Image

arielvalentin commented 2 months ago

Here are profiles for 104 profiles-104.tar.gz

codeboten commented 2 months ago

thanks @arielvalentin, I missed that when i looked your attached profiles a couple of days ago. good to know that using the no-op tracer provider at least reduces the overall memory consumption, even if its still higher than before the change to add tracing to ottl.

arielvalentin commented 2 months ago

No worries @codeboten

Something I may have not mentioned already, but my colleague @antonio had observed the latency regression as early as 105.

We may downgrade to 105 and collect profiles there as well.

antonio commented 2 months ago

Something I may have not mentioned already, but my colleague @antonio had observed the latency regression as early as 105.

That's correct. Notice how request duration went significantly up in the graph below, when I tested 105 back in July.

Image

Investigating this further, this might be related to https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/33508

The time of when this PR was merged does indeed align with when we started seeing the issue.

BinaryFissionGames commented 2 months ago

Just adding what I've found here real quick;

I've done a quick benchmark to see the difference between an actual tracer impl, vs noop tracer, vs all tracing removed in ottl via the transform processor. You can see the benchmark here. All it does is run the transform processor with a batch of 100 logs, with a simple set operation (which seems like a pretty normal workload).

These are the results:

goos: darwin
goarch: arm64
pkg: github.com/open-telemetry/opentelemetry-collector-contrib/processor/transformprocessor
cpu: Apple M3 Pro
                 │ no-tracing.txt │           noop-tracer.txt            │           test-tracer.txt            │
                 │     sec/op     │    sec/op     vs base                │    sec/op     vs base                │
SetLogRecords-12      50.29µ ± 1%   120.55µ ± 4%  +139.70% (p=0.002 n=6)   426.21µ ± 5%  +747.43% (p=0.002 n=6)

                 │ no-tracing.txt │            noop-tracer.txt            │            test-tracer.txt            │
                 │      B/op      │     B/op       vs base                │     B/op       vs base                │
SetLogRecords-12     96.21Ki ± 0%   227.45Ki ± 0%  +136.42% (p=0.002 n=6)   758.38Ki ± 1%  +688.28% (p=0.002 n=6)

                 │ no-tracing.txt │           noop-tracer.txt           │           test-tracer.txt           │
                 │   allocs/op    │  allocs/op   vs base                │  allocs/op   vs base                │
SetLogRecords-12      1.402k ± 0%   3.802k ± 0%  +171.18% (p=0.002 n=6)   9.802k ± 0%  +599.14% (p=0.002 n=6)

You can see that even the noop tracer increases these benchmarks by quite a bit. And it adds up with what I'm seeing in both the issue I posted and here, which is that the overhead of tracing each individual evaluation of the StatementSequence seems like overkill.

Of course, if you do more processing in a single ottl statement, you'd probably get numbers that look a little less striking, so take the exact numbers here with a grain of salt. I don't think the scenario is all too unlikely, though.

It feels like there should not be tracing in StatementSequence at all, but maybe a few metric counters instead. The overhead of the tracing just doesn't seem worth it.

arielvalentin commented 2 months ago

I have got some more for you here v0.107.4 profiles-20240828.tar.gz

cc: @songy23

TylerHelmuth commented 2 months ago

I am going to submit a PR to revert the ottl tracing stuff.

iblancasa commented 2 months ago

I am going to submit a PR to revert the ottl tracing stuff.

How about placing it behind a feature gate?

arielvalentin commented 2 months ago

Thanks again for investigating. I am on the path to downgrade to 0.104.0 and will be eagerly awaiting the next release.

Once it's out, I will deploy to a subset of our canary instances and report back profiles and latency info.

arielvalentin commented 2 months ago

@bvsvas have you all had a chance to generate profiles?

IIRC you all were not using OTTL so the latency you all reported would not be related to the transformers right?

This is the snippet you posted in slack:

service:
  extensions: [apmauth, pprof]
  pipelines:
    traces:
      receivers: [otlp/withauth]
      processors: [probabilistic_sampler]
      exporters: [customexporter]
    metrics:
      receivers: [otlp/withauth, prometheus]
      processors: [cumulativetodelta, resource]
      exporters: [customexporter]
arielvalentin commented 2 months ago

👋🏼 Is there a release schedule/project board handy I can subscribe to?

codeboten commented 2 months ago

The release schedule is documented here: https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/release.md#release-schedule