kubeflow / spark-operator

Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.
Apache License 2.0
2.79k stars 1.37k forks source link

Race condition with webhook admission #1005

Open andizzle opened 4 years ago

andizzle commented 4 years ago

We recently observed webhook admission/deregistration race condition when spark operator is in HA mode. We think this also could happen in standalone mode.

Our operator is configured with replica count = 3 and webhook config like so

      -enable-webhook=true
      -webhook-svc-namespace=stg-tools
      -webhook-port=8080
      -webhook-svc-name=stg-sparkoperator-webhook
      -webhook-config-name=stg-sparkoperator-webhook-config
      -webhook-namespace-selector=sparkoperator.webhook.selector=stg-data
      -leader-election=true
      -leader-election-lock-namespace=stg-tools
      -leader-election-lock-name=spark-operator-lock

Upon upgrade the spark operator, we noticed the webhook registration and deregistration happens in a close time frame and have a high chance of webhook admission been removed when the termination of the old replicaset takes too long.

Aug 24 10:12:43 sparkoperator info I0824 17:12:43.711533       9 webhook.go:235] Webhook prod-sparkoperator-webhook-config deregistered
Aug 24 10:12:51 sparkoperator info I0824 17:12:51.544398       9 webhook.go:218] Starting the Spark admission webhook server
Aug 24 10:12:51 sparkoperator info I0824 17:12:51.613615       9 webhook.go:218] Starting the Spark admission webhook server
Aug 24 10:12:52 sparkoperator info I0824 17:12:52.313517       9 webhook.go:218] Starting the Spark admission webhook server
Aug 24 10:13:02 sparkoperator info I0824 17:13:02.262332       9 webhook.go:235] Webhook prod-sparkoperator-webhook-config deregistered

This result in sparkapplication not be able to receive proper configurations such as configmap mounting, possibly related to other issues such as https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/issues/1000

github-actions[bot] commented 2 weeks ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.