Open pureklkl opened 5 months ago
Could you go into some more detail about the problem you ran into? How many Pods you were creating, how much resources the operator was using, and what happened to the API Server as a result?
A newbie suestion, what is the purpose of the webhook ? And what would be the potential problem to disable it ? Thanks in advance.
The purpose of the webhook is to make changes to Pods before they're created. For the operator this involves injecting autoinstrumentation or sidecar Otel Collector containers. As Pods are immutable, this can only happen when they're created. The API Server calls the webhook on every Pod creation, and the operator gets to decide if it wants to make any changes to the Pod manifest.
Webhooks can also be used for other things, like doing validation, but for Pods specifically, the operator only does what I outlined above.
It's not invalid to want to disable them, but this is fairly advanced customization of the operator, and can get you in trouble if you don't fully understand what you're doing. As a proper solution to @pureklkl's problem, I'd prefer to optimize the webhook so it can easily cope with large bursts of requests.
I encountered a similar issue with the operator running in my cluster, which caused delays in creating new resources. Upon investigation, I traced the problem back to the operator. Deleting the operator pod resolved the issue, restoring normal operation. As a measure, I have excluded the admission webhook(Refer) for all namespaces except the one housing the operator.
I'm still experiencing delays in resource creation within namespaces that host the operator. This namespace includes services like nginx, target allocator, otel collector, and others.
Could you please clarify the following questions for me?
The admission webhook to monitor the pod creation may cause performance burden on k8s api server. Thus before resolving the performance problem, please add an option to turn it off.
The problem webhook https://github.com/open-telemetry/opentelemetry-helm-charts/blob/opentelemetry-operator-0.53.0/charts/opentelemetry-operator/templates/admission-webhooks/operator-webhook.yaml#L128
When significant amount of pods created at the same time, the performance burden will bring down the opentelemetry operator and the following k8s api server which killed the entire cluster. Even though we turnt off the instrumentation, it will cause unnecessary cost on k8s api server. Unlike the instrumentation and opentelemetry collector resource webhook, the performance cost of the pod webhook scaled with the cluster size.
The only optionsthat provide the similar function are namespace/object selector, but they are ideal as just remove this webhook when not used.