kubeflow / spark-operator

Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.
Apache License 2.0
2.75k stars 1.36k forks source link

Intermittent Sidecar Injection Failure for SparkApplication Resources #1920

Open potlurip opened 7 months ago

potlurip commented 7 months ago

I am experiencing intermittent failures with the sidecar container injection for SparkApplication resources managed by the Spark Operator. While the sidecar containers are successfully injected and created in approximately 3 out of 5 instances, there are cases where the injection fails without clear errors in the logs or events related to the sidecar creation process.

Environment

Steps to Reproduce

  1. Deploy the Spark Operator with mutating admission webhooks enabled.
  2. Create a SparkApplication manifest including specifications for a sidecar container.
  3. Apply the SparkApplication manifest using kubectl apply -f sparkapp.yaml.
  4. Observe the creation of Spark application pods and the intermittent absence of the specified sidecar containers.

Expected Behavior

Every instance of SparkApplication resources should result in the creation of Spark application pods with the specified sidecar containers injected consistently.

Actual Behavior

Sidecar containers are only being injected into the Spark application pods in approximately 3 out of 5 attempts. The failures do not coincide with clear errors in the Spark Operator logs, Kubernetes events, or mutating webhook configurations.

Troubleshooting Steps Undertaken

I am seeking guidance on further troubleshooting steps or configurations I might have overlooked. Additionally, any insights into known issues, workarounds, or fixes would be greatly appreciated.

jkleckner commented 7 months ago

Do the other pods not even show up or are they just not functioning?

I don't know if this will help, but if you can use k8s 1.28 or 1.29, there is a new sidecar lifecycle feature (on by default in 1.29) [1] that might help orchestrate startup and shutdown. I have not been able to try it yet, but am interested for proper graceful termination for pods. We experience issues with autoscalers (not in Spark) where the SIGKILL of the istio container brings down the whole pod before the specified graceful termination period.

Check out feature gates PodReadyToStartContainersCondition and SidecarContainers [2].

[1] https://kubernetes.io/docs/concepts/workloads/pods/sidecar-containers/#sidecar-containers-and-pod-lifecycle

[2] https://kubernetes.io/docs/reference/command-line-tools-reference/feature-gates/

potlurip commented 7 months ago

Thank you for your response, @jkleckner.

After reviewing the API server logs, I identified an intermittent error that error occurs during the instances when sidecar creation fails:

Failed calling webhook, failing open webhook.sparkoperator.k8s.io: failed calling webhook "webhook.sparkoperator.k8s.io": failed to call webhook: Post "https://spark-operator-webhook.spark-operator.svc:443/webhook?timeout=30s": tls: failed to verify certificate: x509: certificate is valid for metrics-server.kube-system.svc, not spark-operator-webhook.spark-operator.svc

Looks like the issue lies in the TLS certificate verification process. The error indicates that the TLS certificate presented by the webhook server was not valid for the domain name spark-operator-webhook.spark-operator.svc that the Kubernetes API server tried to connect to. Instead, the certificate is valid for metrics-server.kube-system.svc, which is a different service within the cluster.

Given that this error occurs only sometimes, and the sidecar is successfully created most of the time, the configuration and certificates are likely correct. I'm confused about the underlying cause. It doesn't seem to be a straightforward TLS misconfiguration, as the issue isn't consistent. I'm still troubleshooting to find the root cause of the issue.

YanivKunda commented 2 months ago

Is it possible this might get resolved by https://github.com/kubeflow/spark-operator/pull/2083?