kubeflow / spark-operator

Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.
Apache License 2.0
2.74k stars 1.36k forks source link

-webhook-fail-on-error can't be configured with the helm chart #1907

Closed Dlougach closed 3 weeks ago

Dlougach commented 8 months ago

There is a pull request addressing this issue (#1685) but nobody has reviewed it.

Dlougach commented 8 months ago

A workaround for the time being is to use postrender script in Helm that looks like:

yq '. |= (select(.kind == "Deployment") | .spec.template.spec.containers[0].args += "-webhook-fail-on-error=true")' -
domenicbove commented 7 months ago

Hi @Dlougach thanks for this fix! Can you describe what the -webhook-fail-on-error=true arg does?

For context, on my deployment it looks like the webhook certs expire after sometime, and after that the spark operator creates invalid driver pods (missing pvc mount), and my driver pods start to fail. the spark-operator pod does not crash though, just logs TLS errors.

rolling the spark-operator deployment seems to resolve the issue. and my solution was to create a simple k8s cronjob that runs kubectl rollout restart but if this argument essentially restarts the pod, that would be way better! LMK!

Dlougach commented 7 months ago

Can you describe what the -webhook-fail-on-error=true arg does?

It makes sure that if kubernetes API can't reach the webhook, then the pod creation will fail (the default behaviour is that it will run as is - in the configuration it was created with).

domenicbove commented 7 months ago

Hmmm, that sounds helpful. I guess in my case where the webhook wasn't working and the pods were getting created incorrectly, but I really just need the regen certs script to run in that case. Like I'd hope this could be self healing. Any thoughts?

schaffino commented 7 months ago

@domenicbove, if its any help, I had a similar issue, and raising a PR here doesn't seem like it would do any good. Rather than fork the chart I was able to patch the value into the deployment, as we're using helmfile which has Kustomize support.

The root cause of our operator stopping working was a couple of things.

We had several deployments of the operator in diff namespaces and they were all overwriting the same webhook config. Previously running operators would start fail when deploying new ones. This was fixed by properly setting fullnameOverride so each webhook config was unique.

We also found that the helm hooks were causing issues when we deployed as they would get kicked off but the pods weren't rolled. This meant the running operator no longer had the correct certs and would start to fail. We fixed that by changing the hook values for init and cleanup to remove the pre-upgrade step and haven't had any issues since.

github-actions[bot] commented 1 month ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] commented 3 weeks ago

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.