open-telemetry / opentelemetry-helm-charts

OpenTelemetry Helm Charts
https://opentelemetry.io
Apache License 2.0
389 stars 468 forks source link

Add an option to turn off pod webhook #1115

Open pureklkl opened 5 months ago

pureklkl commented 5 months ago

The admission webhook to monitor the pod creation may cause performance burden on k8s api server. Thus before resolving the performance problem, please add an option to turn it off.

The problem webhook https://github.com/open-telemetry/opentelemetry-helm-charts/blob/opentelemetry-operator-0.53.0/charts/opentelemetry-operator/templates/admission-webhooks/operator-webhook.yaml#L128

When significant amount of pods created at the same time, the performance burden will bring down the opentelemetry operator and the following k8s api server which killed the entire cluster. Even though we turnt off the instrumentation, it will cause unnecessary cost on k8s api server. Unlike the instrumentation and opentelemetry collector resource webhook, the performance cost of the pod webhook scaled with the cluster size.

The only optionsthat provide the similar function are namespace/object selector, but they are ideal as just remove this webhook when not used.

swiatekm commented 5 months ago

Could you go into some more detail about the problem you ran into? How many Pods you were creating, how much resources the operator was using, and what happened to the API Server as a result?

jcdauchy commented 5 months ago

A newbie suestion, what is the purpose of the webhook ? And what would be the potential problem to disable it ? Thanks in advance.

swiatekm commented 5 months ago

The purpose of the webhook is to make changes to Pods before they're created. For the operator this involves injecting autoinstrumentation or sidecar Otel Collector containers. As Pods are immutable, this can only happen when they're created. The API Server calls the webhook on every Pod creation, and the operator gets to decide if it wants to make any changes to the Pod manifest.

Webhooks can also be used for other things, like doing validation, but for Pods specifically, the operator only does what I outlined above.

It's not invalid to want to disable them, but this is fairly advanced customization of the operator, and can get you in trouble if you don't fully understand what you're doing. As a proper solution to @pureklkl's problem, I'd prefer to optimize the webhook so it can easily cope with large bursts of requests.

vivekkumarchaurasia123 commented 1 month ago

I encountered a similar issue with the operator running in my cluster, which caused delays in creating new resources. Upon investigation, I traced the problem back to the operator. Deleting the operator pod resolved the issue, restoring normal operation. As a measure, I have excluded the admission webhook(Refer) for all namespaces except the one housing the operator.

I'm still experiencing delays in resource creation within namespaces that host the operator. This namespace includes services like nginx, target allocator, otel collector, and others.

Could you please clarify the following questions for me?