open-telemetry / opentelemetry-operator

Kubernetes Operator for OpenTelemetry Collector
Apache License 2.0
1.19k stars 430 forks source link

OpenTelemetry webhook modifies orphaned pods in an invalid way #1514

Closed andrewdinunzio closed 7 months ago

andrewdinunzio commented 1 year ago

Using the opentelemetry operator helm chart version 0.18.3.

This will be a bit difficult to describe or reproduce, but wanted to open this to describe what I found in the meantime.

Yesterday, we saw issues where a huge number of pods were stuck in Terminating phase. The cause of that is still somewhat unknown but we figure one of two things happened:

We have finalizers on Pods for record-keeping purposes that our own operator removes once its work is complete. However, our operator was unable to remove the finalizer. It reported errors like:

"The Pod "{PODNAME}" is invalid: spec.initContainers: Forbidden: pod updates may not add or remove containers"

We also tried to manually remove the finalizer, both via kubectl patch, which had the same error, and with kubectl edit which had an error saying the pod was invalid for the same reason.

So here's where I think there could be an issue with the opentelemetry webhook:

Our pod has an init container already, so I tried to remove that as part of the kubectl edit, and was met with a different 'invalid pod' error. In this case, there was an init container with multiple duplicate keys. Like, there were two name fields, one with the name of our init container and one that appeared to be injected by opentelemetry webhook.

Seeing this, I modified the webhook service to point to a bogus port to effectively disable it. After that, I was able to successfully remove the finalizer.

Some other notes:

I'll try to add additional notes as I find them, but this is the gist of it.

svmaris commented 1 year ago

We're running into a similar issue with Jobs. When we auto-instrument them using the instrumentation.opentelemetry.io/inject-sdk: "true" annotation, the Pods are stuck on Terminating indefinitely after the Job completes.

Manually removing the batch.kubernetes.io/job-tracking finalizer is not allowed. Only after deleting the opentelemetry-operator Mutating/ValidatingWebhooks, I'm able to remove the finalizer and delete the stuck Pod.

Not sure what's happening here, but it seems like the Webhook is preventing the Job Controller from removing the finalizer and finishing the Job.

edude03 commented 1 year ago

Looks like the same issue(s) as #1541 #1417 and #1366.

I'm having the same issue. One thing I noticed that you can see in #1541 - is that attempting to re-inject the same variables that already exist - for example

            Name: "OTEL_RESOURCE_ATTRIBUTES",                                                                                                                             
            Value: strings.Join({                                                                                                                                         
              ... // 131 identical bytes                                                                                                                                  
              "TEL_RESOURCE_ATTRIBUTES_NODE_NAME),k8s.pod.name=$(OTEL_RESOURCE_",                                                                                         
              "ATTRIBUTES_POD_NAME),k8s.replicaset.name=xxxxx",                                                                                         
+             ",k8s.node.name=$(OTEL_RESOURCE_ATTRIBUTES_NODE_NAME),k8s.pod.nam",                                                                                         
+             "e=$(OTEL_RESOURCE_ATTRIBUTES_POD_NAME),k8s.pod.uid=xxxx",                                                                                                                                                                                                                  
            }, ""),                             

Notice that k8s.pod.name exists on the first line, but is being added again on the third line. My guess is the check that checks if the pod was already injected is failing to match for some reason and thus is trying again.

What I don't understand however, is what is trying to add these variables and why? As far as I can tell, the variables are added via a MutatingWebHook so there shouldn't be updates to the pod after creation 🤔

andrewdinunzio commented 1 year ago

I feel like I am getting close to hitting this with the debugger using Telepresence. If it helps anyone else look into this, I made the following changes to hook into the webhook and was able to hit some breakpoints in the webhook:

diff --git a/Makefile b/Makefile
index febd731..fad5ec6 100644
--- a/Makefile
+++ b/Makefile
@@ -13,6 +13,7 @@ AUTO_INSTRUMENTATION_DOTNET_VERSION ?= "$(shell grep -v '\#' versions.txt | grep
 AUTO_INSTRUMENTATION_APACHE_HTTPD_VERSION ?= "$(shell grep -v '\#' versions.txt | grep autoinstrumentation-apache-httpd | awk -F= '{print $$2}')"
 LD_FLAGS ?= "-X ${VERSION_PKG}.version=${VERSION} -X ${VERSION_PKG}.buildDate=${VERSION_DATE} -X ${VERSION_PKG}.otelCol=${OTELCOL_VERSION} -X ${VERSION_PKG}.targetAllocator=${TARGETALLOCATOR_VERSION} -X ${VERSION_PKG}.operatorOpAMPBridge=${OPERATOR_OPAMP_BRIDGE_VERSION} -X ${VERSION_PKG}.autoInstrumentationJava=${AUTO_INSTRUMENTATION_JAVA_VERSION} -X ${VERSION_PKG}.autoInstrumentationNodeJS=${AUTO_INSTRUMENTATION_NODEJS_VERSION} -X ${VERSION_PKG}.autoInstrumentationPython=${AUTO_INSTRUMENTATION_PYTHON_VERSION} -X ${VERSION_PKG}.autoInstrumentationDotNet=${AUTO_INSTRUMENTATION_DOTNET_VERSION}" ARCH ?= $(shell go env GOARCH)
+SHELL := /bin/bash

 # Image URL to use all building/pushing image targets
 IMG_PREFIX ?= ghcr.io/${USER}/opentelemetry-operator
@@ -45,7 +46,7 @@ GOBIN=$(shell go env GOBIN)
 endif

 # by default, do not run the manager with webhooks enabled. This only affects local runs, not the build or in-cluster deployments.
-ENABLE_WEBHOOKS ?= false
+ENABLE_WEBHOOKS ?= true

@@ -101,10 +102,24 @@ test: generate fmt vet ensure-generate-is-noop envtest
 manager: generate fmt vet
        go build -o bin/manager main.go

+.PHONY: copy-certs
+copy-certs:
+       @[ -d /tmp/k8s-webhook-server/serving-certs/ ] || { \
+       mkdir -p /tmp/k8s-webhook-server/serving-certs/;\
+       kubectl get secret monitoring-stack-aaaa-opentelemetry-operator-controller-manager-service-cert -o yaml | yq '.data["tls.crt"]' | base64 --decode > /tmp/k8s-webhook-server/serving-certs/tls.crt;\
+       kubectl get secret monitoring-stack-aaaa-opentelemetry-operator-controller-manager-service-cert -o yaml | yq '.data["tls.key"]' | base64 --decode > /tmp/k8s-webhook-server/serving-certs/tls.key;\
+       kubectl get secret monitoring-stack-aaaa-opentelemetry-operator-controller-manager-service-cert -o yaml | yq '.data["ca.crt"]' | base64 --decode > /tmp/k8s-webhook-server/serving-certs/ca.crt;\
+       }
+
 # Run against the configured Kubernetes cluster in ~/.kube/config
 .PHONY: run
-run: generate fmt vet manifests
-       ENABLE_WEBHOOKS=$(ENABLE_WEBHOOKS) go run -ldflags ${LD_FLAGS} ./main.go --zap-devel
+run: generate fmt vet manifests copy-certs
+       go build -o bin/otel-operator -gcflags='all=-N -l' -ldflags ${LD_FLAGS} ./main.go
+       ENABLE_WEBHOOKS=$(ENABLE_WEBHOOKS) telepresence intercept monitoring-stack-aaaa-opentelemetry-operator --port 9443:443 -- dlv exec bin/otel-operator --headless --listen=:2345 --log --api-version=2 --accept-multiclient --continue --only-same-user=false -- --zap-devel --webhook-port=9443 --metrics-addr=0.0.0.0:8080 --enable-leader-election --health-probe-addr=:8081 --collector-image=otel/opentelemetry-collector-contrib:0.71.0

Also had to add this to main.go (put whatever namespace the otel operator is in):

+               LeaderElectionNamespace: "monitoring",

Also added this launch config to .vscode/launch.json:

{
    // Use IntelliSense to learn about possible attributes.
    // Hover to view descriptions of existing attributes.
    // For more information, visit: https://go.microsoft.com/fwlink/?linkid=830387
    "version": "0.2.0",
    "configurations": [
        {
            "name": "Debug",
            "type": "go",
            "debugAdapter": "dlv-dap",
            "request": "attach",
            "mode": "remote",
            "port": 2345,
            "host": "127.0.0.1"
        }
    ]
}

Started debugging with

telepresence helm install # install telepresence into the cluster
kubectl config set-context --current --namespace {otel-namespace}
make run # start local instance with debugger and connect telepresence
# run the "Debug" launch configuration from the Debug tab

Note that when intercepting with Telepresence you need to use the name of the Deployment not the Service (monitoring-stack-aaaa-opentelemetry-operator in my case). It figures out the Service based on what ports, etc.

GarrettGeorge commented 1 year ago

(Feel free to remove this if deemed too out of scope or off topic for this issue)

Are there any upstream issues found related to this? Although I'm not using this operator, I do have an Open Policy Agent mutating webhook which exhibits the same behavior described here.

Our mutation is at the .spec.containers[*].image and spec.initContainers[*].image level to ensure all images come from our internal registry. In our "bad pod", the mutation doesn't seem to have ever been applied and when the webhook pod receives the request it tries to update the image and fails for similar reasons above:

# pods "cronjob-***" was not valid:
 # * spec: Forbidden: pod updates may not change fields other than `spec.containers[*].image`, `spec.initContainers[*].image`, `spec.activeDeadlineSeconds`, `spec.tolerations` (only additions to existing       tolerations) or `spec.terminationGracePeriodSeconds` (allow it to be set to 1 if it was previously negative)
 #   core.PodSpec{
 #     ... // 11 identical fields
 #     NodeName:        "aks-nodepool1-***",
 #     SecurityContext: &{},
 #     ImagePullSecrets: []core.LocalObjectReference{
 # -     {Name: "private-registry1"},
 # +     {Name: "private-registry2"},
 # -     {Name: "private-registry3"},
 #     },
 #     Hostname:  "",
 #     Subdomain: "",
 #     ... // 15 identical fields
 #   }

Even if you try to change something the API says you should be able to, it complains. Not sure if this is necessarily an upstream issue, but it's definitely not unique to opentelemetry-operator.

kristian-schjelderup commented 9 months ago

Seems like the mutating webhook doesn't account for the life-cycle of the resource it gets in. The modifications to the Pod definition should be done in the initial Pod CREATE. and not trying to modify anything after this.

Based on this, my take would be that the the MutatingWebhookConfiguration should only trigger on CREATE operations for Pod resources, and not both on CREATE and UPDATE as it currently does.

andrewdinunzio commented 8 months ago

Something like this? https://github.com/open-telemetry/opentelemetry-operator/pull/2584