Race condition with webhook admission

We recently observed webhook admission/deregistration race condition when spark operator is in HA mode. We think this also could happen in standalone mode.

Our operator is configured with replica count = 3 and webhook config like so

      -enable-webhook=true
      -webhook-svc-namespace=stg-tools
      -webhook-port=8080
      -webhook-svc-name=stg-sparkoperator-webhook
      -webhook-config-name=stg-sparkoperator-webhook-config
      -webhook-namespace-selector=sparkoperator.webhook.selector=stg-data
      -leader-election=true
      -leader-election-lock-namespace=stg-tools
      -leader-election-lock-name=spark-operator-lock

Upon upgrade the spark operator, we noticed the webhook registration and deregistration happens in a close time frame and have a high chance of webhook admission been removed when the termination of the old replicaset takes too long.

Aug 24 10:12:43 sparkoperator info I0824 17:12:43.711533       9 webhook.go:235] Webhook prod-sparkoperator-webhook-config deregistered
Aug 24 10:12:51 sparkoperator info I0824 17:12:51.544398       9 webhook.go:218] Starting the Spark admission webhook server
Aug 24 10:12:51 sparkoperator info I0824 17:12:51.613615       9 webhook.go:218] Starting the Spark admission webhook server
Aug 24 10:12:52 sparkoperator info I0824 17:12:52.313517       9 webhook.go:218] Starting the Spark admission webhook server
Aug 24 10:13:02 sparkoperator info I0824 17:13:02.262332       9 webhook.go:235] Webhook prod-sparkoperator-webhook-config deregistered

This result in sparkapplication not be able to receive proper configurations such as configmap mounting, possibly related to other issues such as https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/issues/1000

kubeflow / spark-operator

Race condition with webhook admission #1005