0.4.0 blocks termination of pods created by jobs

z0rc commented 1 year ago

I updated harbor-container-webhook from 0.3.5 to 0.4.0 in one of my clusters. Since the upgrade I observed strange behavior, where pods that were created by jobs weren't able to complete termination. Pods were stuck in Terminating state, there was no way to delete them even with --force.

Initially I suspected that this might be related to finalizers in some way, as pods created by jobs have them. I tired edit affected pods and unset/nullify finalizers, but to my finding these edit attempts didn't take hold, finalizers were still present after applying edits.

So I switched to investigating admission webhooks. Then I found that logs of harbor-container-webhook had new kind of event

2023-03-24T23:24:40Z    DEBUG   controller-runtime.webhook.webhooks wrote response  {"webhook": "/webhook-v1-pod", "code": 400, "reason": "", "UID": "7183f127-a1d2-486a-89e1-e74bd353dc8b", "allowed": false}

Which were closely correlated to pod termination events and edits.

I reverted to 0.3.5 and that fixed the issue for me, affected terminated pods were automatically cleaned up in course of minutes.

I'm happy to continue debugging this further, but need some guidance, as right now debug logs provide non-descriptive messages.

cnmcavoy commented 1 year ago

I fixed a small bug in the node lookup and cut a release for v0.4.1, although without more logs from your side I'm not sure it's the same issue as here.

If you want to test out v0.4.1 and report back, let me know if it works as expected.

z0rc commented 1 year ago

Thanks.

I installed 0.4.1 on the same cluster and let it run for couple of days. So far no issues with job's pod stuck in termination.

z0rc commented 1 year ago

Reopening as I see the same issue again.

Here are sample logs:

2023-04-20T17:43:14Z    DEBUG    controller-runtime.webhook.webhooks    received request    {"webhook": "/webhook-v1-pod", "UID": "62d7cc45-128d-4c0b-ad12-4450fcde6080", "kind": "/v1, Kind=Pod", "resource": {"group":"","version":"v1","
resource":"pods"}}
2023-04-20T17:43:15Z    ERROR    mutator    rejected patching pod 4fd9e7b7-9f2b-4304-99a9-b48a022a3bdf, failed to lookup node arch or os    {"error": "failed to lookup node ip-10-11-151-117.eu-central-1.compute.internal: nodes \"ip-10-
11-151-117.eu-central-1.compute.internal\" not found"}
indeed.com/devops-incubation/harbor-container-webhook/internal/webhook.(*PodContainerProxier).Handle
    /workspace/internal/webhook/mutate.go:70
sigs.k8s.io/controller-runtime/pkg/webhook/admission.(*Webhook).Handle
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.5/pkg/webhook/admission/webhook.go:169
sigs.k8s.io/controller-runtime/pkg/webhook/admission.(*Webhook).ServeHTTP
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.5/pkg/webhook/admission/http.go:98
github.com/prometheus/client_golang/prometheus/promhttp.InstrumentHandlerInFlight.func1
    /go/pkg/mod/github.com/prometheus/client_golang@v1.14.0/prometheus/promhttp/instrument_server.go:60
net/http.HandlerFunc.ServeHTTP
    /usr/local/go/src/net/http/server.go:2122
github.com/prometheus/client_golang/prometheus/promhttp.InstrumentHandlerCounter.func1
    /go/pkg/mod/github.com/prometheus/client_golang@v1.14.0/prometheus/promhttp/instrument_server.go:146
net/http.HandlerFunc.ServeHTTP
    /usr/local/go/src/net/http/server.go:2122
github.com/prometheus/client_golang/prometheus/promhttp.InstrumentHandlerDuration.func2
    /go/pkg/mod/github.com/prometheus/client_golang@v1.14.0/prometheus/promhttp/instrument_server.go:108
net/http.HandlerFunc.ServeHTTP
    /usr/local/go/src/net/http/server.go:2122
net/http.(*ServeMux).ServeHTTP
    /usr/local/go/src/net/http/server.go:2500
net/http.serverHandler.ServeHTTP
    /usr/local/go/src/net/http/server.go:2936
net/http.(*conn).serve
    /usr/local/go/src/net/http/server.go:1995
2023-04-20T17:43:15Z    DEBUG    controller-runtime.webhook.webhooks    wrote response    {"webhook": "/webhook-v1-pod", "code": 400, "reason": "", "UID": "62d7cc45-128d-4c0b-ad12-4450fcde6080", "allowed": false}

Pods stuck in Terminating state and were running on node that's already removed from cluster.

As I can tell this is caused by node leaving the cluster prior to drain process was able to complete and gracefully terminate all pods on node. I'm running cluster using spot instances, so this kind of event is expected and actually harmless.

cnmcavoy commented 1 year ago

I refactored the codepath to fail-open and default to the webhooks arch and OS in this type of situation, and log the issue as a warning. I pushed the new release as v0.4l.2 -- hopefully that's the end of this bug!

indeedeng / harbor-container-webhook

0.4.0 blocks termination of pods created by jobs #20