Open potlurip opened 7 months ago
Do the other pods not even show up or are they just not functioning?
I don't know if this will help, but if you can use k8s 1.28 or 1.29, there is a new sidecar lifecycle feature (on by default in 1.29) [1] that might help orchestrate startup and shutdown. I have not been able to try it yet, but am interested for proper graceful termination for pods. We experience issues with autoscalers (not in Spark) where the SIGKILL of the istio container brings down the whole pod before the specified graceful termination period.
Check out feature gates PodReadyToStartContainersCondition
and SidecarContainers
[2].
[2] https://kubernetes.io/docs/reference/command-line-tools-reference/feature-gates/
Thank you for your response, @jkleckner.
After reviewing the API server logs, I identified an intermittent error that error occurs during the instances when sidecar creation fails:
Failed calling webhook, failing open webhook.sparkoperator.k8s.io: failed calling webhook "webhook.sparkoperator.k8s.io": failed to call webhook: Post "https://spark-operator-webhook.spark-operator.svc:443/webhook?timeout=30s": tls: failed to verify certificate: x509: certificate is valid for metrics-server.kube-system.svc, not spark-operator-webhook.spark-operator.svc
Looks like the issue lies in the TLS certificate verification process. The error indicates that the TLS certificate presented by the webhook server was not valid for the domain name spark-operator-webhook.spark-operator.svc
that the Kubernetes API server tried to connect to. Instead, the certificate is valid for metrics-server.kube-system.svc
, which is a different service within the cluster.
Given that this error occurs only sometimes, and the sidecar is successfully created most of the time, the configuration and certificates are likely correct. I'm confused about the underlying cause. It doesn't seem to be a straightforward TLS misconfiguration, as the issue isn't consistent. I'm still troubleshooting to find the root cause of the issue.
Is it possible this might get resolved by https://github.com/kubeflow/spark-operator/pull/2083?
I am experiencing intermittent failures with the sidecar container injection for
SparkApplication
resources managed by the Spark Operator. While the sidecar containers are successfully injected and created in approximately 3 out of 5 instances, there are cases where the injection fails without clear errors in the logs or events related to the sidecar creation process.Environment
Steps to Reproduce
SparkApplication
manifest including specifications for a sidecar container.SparkApplication
manifest usingkubectl apply -f sparkapp.yaml
.Expected Behavior
Every instance of
SparkApplication
resources should result in the creation of Spark application pods with the specified sidecar containers injected consistently.Actual Behavior
Sidecar containers are only being injected into the Spark application pods in approximately 3 out of 5 attempts. The failures do not coincide with clear errors in the Spark Operator logs, Kubernetes events, or mutating webhook configurations.
Troubleshooting Steps Undertaken
SparkApplication
manifests are correctly formatted and consistent across successful and unsuccessful attempts.I am seeking guidance on further troubleshooting steps or configurations I might have overlooked. Additionally, any insights into known issues, workarounds, or fixes would be greatly appreciated.