Intermittent Sidecar Injection Failure for SparkApplication Resources

potlurip commented 7 months ago

I am experiencing intermittent failures with the sidecar container injection for SparkApplication resources managed by the Spark Operator. While the sidecar containers are successfully injected and created in approximately 3 out of 5 instances, there are cases where the injection fails without clear errors in the logs or events related to the sidecar creation process.

Environment

Spark Operator Version: latest version
Kubernetes Version: 1.25
Cloud Provider: AWS - EKS
Installation Method: Helm chart

Steps to Reproduce

Deploy the Spark Operator with mutating admission webhooks enabled.
Create a SparkApplication manifest including specifications for a sidecar container.
Apply the SparkApplication manifest using kubectl apply -f sparkapp.yaml.
Observe the creation of Spark application pods and the intermittent absence of the specified sidecar containers.

Expected Behavior

Every instance of SparkApplication resources should result in the creation of Spark application pods with the specified sidecar containers injected consistently.

Actual Behavior

Sidecar containers are only being injected into the Spark application pods in approximately 3 out of 5 attempts. The failures do not coincide with clear errors in the Spark Operator logs, Kubernetes events, or mutating webhook configurations.

Troubleshooting Steps Undertaken

Reviewed Spark Operator logs for errors or warnings related to sidecar injection.
Checked mutating webhook configurations for correctness.
Inspected Kubernetes events for any signs of failed operations or errors during pod creation.
Validated that SparkApplication manifests are correctly formatted and consistent across successful and unsuccessful attempts.
Observed no clear patterns in the failures regarding specific nodes, times, or cluster conditions.

I am seeking guidance on further troubleshooting steps or configurations I might have overlooked. Additionally, any insights into known issues, workarounds, or fixes would be greatly appreciated.

jkleckner commented 7 months ago

Do the other pods not even show up or are they just not functioning?

I don't know if this will help, but if you can use k8s 1.28 or 1.29, there is a new sidecar lifecycle feature (on by default in 1.29) [1] that might help orchestrate startup and shutdown. I have not been able to try it yet, but am interested for proper graceful termination for pods. We experience issues with autoscalers (not in Spark) where the SIGKILL of the istio container brings down the whole pod before the specified graceful termination period.

Check out feature gates PodReadyToStartContainersCondition and SidecarContainers [2].

[1] https://kubernetes.io/docs/concepts/workloads/pods/sidecar-containers/#sidecar-containers-and-pod-lifecycle

[2] https://kubernetes.io/docs/reference/command-line-tools-reference/feature-gates/

potlurip commented 7 months ago

Thank you for your response, @jkleckner.

After reviewing the API server logs, I identified an intermittent error that error occurs during the instances when sidecar creation fails:

Failed calling webhook, failing open webhook.sparkoperator.k8s.io: failed calling webhook "webhook.sparkoperator.k8s.io": failed to call webhook: Post "https://spark-operator-webhook.spark-operator.svc:443/webhook?timeout=30s": tls: failed to verify certificate: x509: certificate is valid for metrics-server.kube-system.svc, not spark-operator-webhook.spark-operator.svc

Looks like the issue lies in the TLS certificate verification process. The error indicates that the TLS certificate presented by the webhook server was not valid for the domain name spark-operator-webhook.spark-operator.svc that the Kubernetes API server tried to connect to. Instead, the certificate is valid for metrics-server.kube-system.svc, which is a different service within the cluster.

Given that this error occurs only sometimes, and the sidecar is successfully created most of the time, the configuration and certificates are likely correct. I'm confused about the underlying cause. It doesn't seem to be a straightforward TLS misconfiguration, as the issue isn't consistent. I'm still troubleshooting to find the root cause of the issue.

YanivKunda commented 2 months ago

Is it possible this might get resolved by https://github.com/kubeflow/spark-operator/pull/2083?

kubeflow / spark-operator