OperatorSDK issue after restarting neon-cluster-operator?

jefflill commented 1 year ago

It looks like the OperatorSDK may be having problems reestablishing webhooks after restarting the operator.

I restarted neon-cluster-operator after setting LOG_LEVEL=trace when trying to debug the performance issue. The API Server immediately has fairly high CPU usage and the API Server looks like it's unable to send webhook requests to the new neon-cluster-operator pod (you can also see the neon-acme OpenAPIs intermixed as well #1847):

{"ts":1692403829344.051,"caller":"openapi/controller.go:116","msg":"loading OpenAPI spec for \"v1alpha1.acme.neoncloud.io\" failed with: OpenAPI spec does not exist\n"}
{"ts":1692403829344.0842,"caller":"openapi/controller.go:129","msg":"OpenAPI AggregationController: action for item v1alpha1.acme.neoncloud.io: Rate Limited Requeue.\n","v":0}
{"ts":1692403836935.4392,"caller":"mutating/dispatcher.go:180","msg":"Failed calling webhook, failing open deployment-policy.neonkube.io: failed calling webhook \"deployment-policy.neonkube.io\": failed to call webhook: Post \"https://neon-cluster-operator.neon-system.svc:443/apps/v1/deployments/deploymentwebhook/mutate?timeout=5s\": dial tcp 10.253.74.44:443: connect: connection refused\n","v":0}
{"ts":1692403836935.4954,"caller":"mutating/dispatcher.go:184","msg":"failed calling webhook \"deployment-policy.neonkube.io\": failed to call webhook: Post \"https://neon-cluster-operator.neon-system.svc:443/apps/v1/deployments/deploymentwebhook/mutate?timeout=5s\": dial tcp 10.253.74.44:443: connect: connection refused\n"}
{"ts":1692403846959.5413,"caller":"mutating/dispatcher.go:180","msg":"Failed calling webhook, failing open deployment-policy.neonkube.io: failed calling webhook \"deployment-policy.neonkube.io\": failed to call webhook: Post \"https://neon-cluster-operator.neon-system.svc:443/apps/v1/deployments/deploymentwebhook/mutate?timeout=5s\": dial tcp 10.253.74.44:443: connect: connection refused\n","v":0}
{"ts":1692403846959.5977,"caller":"mutating/dispatcher.go:184","msg":"failed calling webhook \"deployment-policy.neonkube.io\": failed to call webhook: Post \"https://neon-cluster-operator.neon-system.svc:443/apps/v1/deployments/deploymentwebhook/mutate?timeout=5s\": dial tcp 10.253.74.44:443: connect: connection refused\n"}
{"ts":1692403856987.9785,"caller":"mutating/dispatcher.go:180","msg":"Failed calling webhook, failing open deployment-policy.neonkube.io: failed calling webhook \"deployment-policy.neonkube.io\": failed to call webhook: Post \"https://neon-cluster-operator.neon-system.svc:443/apps/v1/deployments/deploymentwebhook/mutate?timeout=5s\": dial tcp 10.253.74.44:443: connect: connection refused\n","v":0}
{"ts":1692403856988.0146,"caller":"mutating/dispatcher.go:184","msg":"failed calling webhook \"deployment-policy.neonkube.io\": failed to call webhook: Post \"https://neon-cluster-operator.neon-system.svc:443/apps/v1/deployments/deploymentwebhook/mutate?timeout=5s\": dial tcp 10.253.74.44:443: connect: connection refused\n"}
{"ts":1692403867015.4243,"caller":"mutating/dispatcher.go:180","msg":"Failed calling webhook, failing open deployment-policy.neonkube.io: failed calling webhook \"deployment-policy.neonkube.io\": failed to call webhook: Post \"https://neon-cluster-operator.neon-system.svc:443/apps/v1/deployments/deploymentwebhook/mutate?timeout=5s\": dial tcp 10.253.74.44:443: connect: connection refused\n","v":0}
{"ts":1692403867015.4587,"caller":"mutating/dispatcher.go:184","msg":"failed calling webhook \"deployment-policy.neonkube.io\": failed to call webhook: Post \"https://neon-cluster-operator.neon-system.svc:443/apps/v1/deployments/deploymentwebhook/mutate?timeout=5s\": dial tcp 10.253.74.44:443: connect: connection refused\n"}
{"ts":1692403877039.0762,"caller":"mutating/dispatcher.go:180","msg":"Failed calling webhook, failing open deployment-policy.neonkube.io: failed calling webhook \"deployment-policy.neonkube.io\": failed to call webhook: Post \"https://neon-cluster-operator.neon-system.svc:443/apps/v1/deployments/deploymentwebhook/mutate?timeout=5s\": dial tcp 10.253.74.44:443: connect: connection refused\n","v":0}
{"ts":1692403877039.1125,"caller":"mutating/dispatcher.go:184","msg":"failed calling webhook \"deployment-policy.neonkube.io\": failed to call webhook: Post \"https://neon-cluster-operator.neon-system.svc:443/apps/v1/deployments/deploymentwebhook/mutate?timeout=5s\": dial tcp 10.253.74.44:443: connect: connection refused\n"}
{"ts":1692403887063.5889,"caller":"mutating/dispatcher.go:180","msg":"Failed calling webhook, failing open deployment-policy.neonkube.io: failed calling webhook \"deployment-policy.neonkube.io\": failed to call webhook: Post \"https://neon-cluster-operator.neon-system.svc:443/apps/v1/deployments/deploymentwebhook/mutate?timeout=5s\": dial tcp 10.253.74.44:443: connect: connection refused\n","v":0}
{"ts":1692403887063.643,"caller":"mutating/dispatcher.go:184","msg":"failed calling webhook \"deployment-policy.neonkube.io\": failed to call webhook: Post \"https://neon-cluster-operator.neon-system.svc:443/apps/v1/deployments/deploymentwebhook/mutate?timeout=5s\": dial tcp 10.253.74.44:443: connect: connection refused\n"}
{"ts":1692403897103.4758,"caller":"mutating/dispatcher.go:180","msg":"Failed calling webhook, failing open deployment-policy.neonkube.io: failed calling webhook \"deployment-policy.neonkube.io\": failed to call webhook: Post \"https://neon-cluster-operator.neon-system.svc:443/apps/v1/deployments/deploymentwebhook/mutate?timeout=5s\": dial tcp 10.253.74.44:443: connect: connection refused\n","v":0}
{"ts":1692403897103.534,"caller":"mutating/dispatcher.go:184","msg":"failed calling webhook \"deployment-policy.neonkube.io\": failed to call webhook: Post \"https://neon-cluster-operator.neon-system.svc:443/apps/v1/deployments/deploymentwebhook/mutate?timeout=5s\": dial tcp 10.253.74.44:443: connect: connection refused\n"}

marcusbooyah commented 1 year ago

Was this a single node cluster? Did it not go away once the operator started up?

marcusbooyah commented 1 year ago

I don't think this is an issue

jefflill commented 1 year ago

Yeah, it was probably a single node cluster. This is an example of the sort of thing I've been seeing in logs that seemed a bit weird, so I'm creating issues.

...not sure it's a problem either.

nforgeio / neonKUBE

OperatorSDK issue after restarting neon-cluster-operator? #1852