We recently observed webhook admission/deregistration race condition when spark operator is in HA mode. We think this also could happen in standalone mode.
Our operator is configured with replica count = 3 and webhook config like so
Upon upgrade the spark operator, we noticed the webhook registration and deregistration happens in a close time frame and have a high chance of webhook admission been removed when the termination of the old replicaset takes too long.
Aug 24 10:12:43 sparkoperator info I0824 17:12:43.711533 9 webhook.go:235] Webhook prod-sparkoperator-webhook-config deregistered
Aug 24 10:12:51 sparkoperator info I0824 17:12:51.544398 9 webhook.go:218] Starting the Spark admission webhook server
Aug 24 10:12:51 sparkoperator info I0824 17:12:51.613615 9 webhook.go:218] Starting the Spark admission webhook server
Aug 24 10:12:52 sparkoperator info I0824 17:12:52.313517 9 webhook.go:218] Starting the Spark admission webhook server
Aug 24 10:13:02 sparkoperator info I0824 17:13:02.262332 9 webhook.go:235] Webhook prod-sparkoperator-webhook-config deregistered
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
We recently observed webhook admission/deregistration race condition when spark operator is in HA mode. We think this also could happen in standalone mode.
Our operator is configured with replica count = 3 and webhook config like so
Upon upgrade the spark operator, we noticed the webhook registration and deregistration happens in a close time frame and have a high chance of webhook admission been removed when the termination of the old replicaset takes too long.
This result in sparkapplication not be able to receive proper configurations such as configmap mounting, possibly related to other issues such as https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/issues/1000