otelcol can't get cpu limit and may cause performance issues.

open-telemetry / opentelemetry-collector

OpenTelemetry Collector

https://opentelemetry.io

Apache License 2.0

4.23k stars 1.4k forks source link

otelcol can't get cpu limit and may cause performance issues. #4459

Open awx-fuyuanchu opened 2 years ago

awx-fuyuanchu commented 2 years ago

Describe the bug

2021-11-19T03:39:27.622Z info service/collector.go:230 Starting otelcontribcol... {"Version": "v0.37.1", "NumCPU": 4}

Steps to reproduce

Limit the CPU resource to 1 with K8S node has 4 CPUs.

What did you expect to see?

otelcol could get the CPU limit.

What did you see instead?

otelcol get the CPU number of the node.

What version did you use? Version: v0.37.1

What config did you use?

Environment

GKE v1.19.14-gke.1900

Additional context

So far, the CPU num was used in a set of limited features like bathProcessor and converter. It could be a risk in the feature.

bogdandrutu commented 2 years ago

That is just a log message. The CPU limitation comes from the k8s itself which throttles the process when gets to the limit. We don't do anything with "NumCPU" except that we print it. If you believe that it is confusing we can remove that message.

morigs commented 2 years ago

It's not a problem, of course, but it actually used here

bogdandrutu commented 2 years ago

@morigs interesting. Now we can have a long debate here, usually the operation executed after batching is an I/O op (not that CPU intensive). If we limit that to 1 core (in your example) we will never be able to hit probably 0.7 cores.

So not sure what is the best in this case.

morigs commented 2 years ago

@bogdandrutu In the case of batch processor is not a problem, it's just a channel size. The real issue is the number of threads. Processor as well as exporters can perform CPU intensive tasks (complex sampling, serialization etc) so they will try to utilize as many cores as possible. This will lead to throttling (which is a bad thing). IMO there are two solutions:

Document how to use otel-collector in K8S (setting correct limits and GOMAXPROCS). And probably fix (if not already) this issue in otel operator.
Use something like this

Serpent6877 commented 1 year ago

I am curious about the same potential issue. We use opentelemetry collector as a sidecar on GKE. We use 0.48.0 and I see this in the logs

service/collector.go:252 Starting otelcol-contrib... {"Version": "0.48.0", "NumCPU": 16}

which is the virtuals CPU. We allocate 1 to 4 CPU depending on the deployment. So for the 1 CPU pods are we potentially having issues? We do pretty high volumes of traffic.

gebn commented 1 year ago

Prometheus has a currently-experimental --enable-feature auto-gomaxprocs flag which triggers uber-go/automaxprocs and has worked really well for us.

jpkrohling commented 1 year ago

Document how to use otel-collector in K8S (setting correct limits and GOMAXPROCS). And probably fix (if not already) this issue in otel operator.

I'm in favor of giving this a try. @open-telemetry/operator-approvers , what do you thin?

morigs commented 1 year ago

Should this be implemented as a core feature enabled by default? Or as an extension?

pavolloffay commented 1 year ago

I'm in favor of giving this a try. https://github.com/orgs/open-telemetry/teams/operator-approvers , what do you thin?

Agee on improving this. What changes are proposed for the operator? Should the operator set GOMAXPROCS?

jpkrohling commented 1 year ago

Should the operator set GOMAXPROCS?

Yes, I think it would be a good start.

frzifus commented 1 year ago

How would a cpu limit of 0.9 or 1.1 then be reflected? Is it just GOMAXPROCS=1?

jpkrohling commented 1 year ago

I would round it up: 0.9 becomes 1, 1.1 becomes 2.

edwintye commented 1 year ago

I accidentally stumbled onto the same problem as we were experiencing a lot of throttling during a spike of traffic. Seems to me that there already exists a mechanism to set GOMAXPROCS by adding it in

env:
  - name: GOMAXPROCS
    valueFrom:
      resourceFieldRef:
        containerName: otc-container
        resource: limits.cpu

to either the CR for the operator or directly into the deployment. Is there a scenario where using the roundup mechanism of native k8s is not as good?

max-frank commented 1 month ago

Is there a scenario where using the roundup mechanism of native k8s is not as good?

Any container environment other than k8s that does not easily support a mechanism as resourceFieldRef e.g., GCP CloudRun