kubeflow / spark-operator

Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.
Apache License 2.0
2.79k stars 1.38k forks source link

[BUG] Report: ScheduledSparkApplication Fails to Execute After Helm Chart Upgrade #2288

Open TheDevilDan opened 2 weeks ago

TheDevilDan commented 2 weeks ago

Description : I am deploying my ScheduledSparkApplication and SparkApplication using a custom Helm chart. When I upgrade my Helm chart and change spec.schedule (the execution time or date) of my ScheduledSparkApplication, the new time is correctly reflected in Rancher. However, the ScheduledSparkApplication does not execute when the specified time arrives.

To resolve this issue, I have to manually delete the existing ScheduledSparkApplication, and then perform the Helm upgrade again to recreate them. After this manual process, the applications execute as expected at the new scheduled time.

apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: ScheduledSparkApplication
metadata:
  name: collect-{{ .nameCollector }}
  namespace: {{ $.Release.Namespace }}
spec:
  schedule: {{ .timeScheduled | quote }}
  concurrencyPolicy: Forbid
  template:
    type: Java
[......................................................]

On my values : 
  timeScheduled: "0 7 * * *"

image

Steps to Reproduce

  1. Deploy a ScheduledSparkApplication using a custom Helm chart.
  2. Upgrade the Helm chart and change the spec.schedule (the execution time or date).
  3. Check that the new time is correctly reflected in Rancher (UI shows the updated schedule).
  4. Wait for the scheduled execution time to arrive.
  5. Observe that the Spark scheduled application does not execute at the scheduled time.

Expected Behavior The Spark scheduled application should execute at the new time after upgrading the Helm chart and modifying the schedule, without the need to manually delete and recreate the scheduled Spark application.

Actual Behavior The Spark scheduled application does not execute at the updated time unless the existing scheduled application is deleted and recreated through a Helm upgrade.

Environment Details Kubernetes version: 1.28 Spark-operator version: 2.0.2 Helm chart version: spark-operator-2.0.2 Rancher version: 2.9.2

Possible Cause It seems that the Helm upgrade correctly updates the schedule metadata in the UI (Rancher), but the actual Spark scheduled application does not receive the trigger or update needed to execute the job at the newly set time.

Workaround The current workaround is to manually delete the existing scheduled Spark applications and perform the Helm upgrade again. This recreates the scheduled applications with the correct execution time, allowing them to run as expected.

TheDevilDan commented 1 week ago

Maybe it has something to do with this issue? https://github.com/kubeflow/spark-operator/issues/2285 since you have to remove the resources and upgrade the helm chart to push back the resources? (whatever the status)