kubeflow / spark-operator

Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.
Apache License 2.0
2.79k stars 1.37k forks source link

ScheduledSparkApplication stopped creating new SparkApplication instances after failure #787

Open duongnt opened 4 years ago

duongnt commented 4 years ago

I have a ScheduledSparkApplication instance that was running well. However at some point, it stopped creating new SparkApplication instance. Here are some relevant excerpt of the log:

I 2020-02-03T04:43:42.570749Z Next run of ScheduledSparkApplication spark-operator/spark-app is due, creating a new SparkApplication instance 
I 2020-02-03T04:43:42.597566Z SparkApplication spark-operator/spark-app-1580705022570653626 was added, enqueueing it for submission 
I 2020-02-03T04:43:42.597779Z Starting processing key: "spark-operator/spark-app-1580705022570653626" 
I 2020-02-03T04:43:42.598127Z Event(v1.ObjectReference{Kind:"SparkApplication", Namespace:"spark-operator", Name:"spark-app-1580705022570653626", UID:"c12a53b7-463f-11ea-b43e-42010a4c005b", APIVersion:"sparkoperator.k8s.io/v1beta2", ResourceVersion:"116694103", FieldPath:""}): type: 'Normal' reason: 'SparkApplicationAdded' SparkApplication spark-app-1580705022570653626 was added, enqueuing it for submission 
I 2020-02-03T04:43:42.598367Z spark-submit arguments: [/opt/spark/bin/spark-submit --class sparkApp] 
I 2020-02-03T04:43:42.625613Z Syncing ScheduledSparkApplication spark-operator/spark-app 
I 2020-02-03T04:43:46.286911Z Pod spark-app-1580705022570653626-driver in namespace spark-operator is subject to mutation 
I 2020-02-03T04:43:46.299988Z Pod spark-app-1580705022570653626-driver added in namespace spark-operator. 
I 2020-02-03T04:43:46.319535Z Pod spark-app-1580705022570653626-driver updated in namespace spark-operator. 
I 2020-02-03T04:43:46.341239Z Pod spark-app-1580705022570653626-driver updated in namespace spark-operator. 
I 2020-02-03T04:43:46.857410Z Pod spark-app-1580705022570653626-driver updated in namespace spark-operator. 
I 2020-02-03T04:43:46.865382Z Pod spark-app-1580705022570653626-driver deleted in namespace spark-operator. 
W 2020-02-03T04:43:46.960530Z trying to resubmit an already submitted SparkApplication spark-operator/spark-app-1580705022570653626 
I 2020-02-03T04:43:46.960574Z Trying to update SparkApplication spark-operator/spark-app-1580705022570653626, from: [{  0001-01-01 00:00:00 +0000 UTC 0001-01-01 00:00:00 +0000 UTC { 0    } { } map[] 0 0}] to [{  0001-01-01 00:00:00 +0000 UTC 0001-01-01 00:00:00 +0000 UTC { 0    } { } map[] 0 0}] 
  undefined
I 2020-02-03T04:43:46.960830Z Ending processing key: "spark-operator/spark-app-1580705022570653626" 
I 2020-02-03T04:44:12.570804Z Syncing ScheduledSparkApplication spark-operator/spark-app 
I 2020-02-03T04:44:42.571031Z Syncing ScheduledSparkApplication spark-operator/spark-app 

The status of ScheduledSparkApplication looks like this:

status:
  lastRun: "2020-02-03T04:43:42Z"
  lastRunName: spark-app-1580705022570653626
  nextRun: "2020-02-03T04:48:42Z"
  pastFailedRunNames:
  - spark-app-1580686722390076571
  pastSuccessfulRunNames:
  - spark-app-1580704722567671035
  - spark-app-1580704422565140917
  - spark-app-1580704122562581879
  - spark-app-1580703822559473339
  - spark-app-1580703522556632453
  - spark-app-1580703222553982770
  - spark-app-1580702922551117600
  - spark-app-1580702622548623812
  - spark-app-1580702322546336470
  - spark-app-1580702022543189697
  scheduleState: Scheduled

So looks like the run spark-app-1580705022570653626 is considered neither failed nor successful, and that blocks the next run from being started? I'm not sure why spark-app-1580705022570653626 wasn't able to start either, but it definitely left the ScheduledSparkApplication deployment in a bad state.

danielfree commented 4 years ago

we have seen similar situations with ScheduledSparkApplication, the new driver pod did not come up due to errors like this:

Unable to mount volumes for pod "app-1583110801067267339-driver_spark-app(27f42810-5c21-11ea-8dc4-525400048875)": timeout expired waiting for volumes to attach or mount for pod "spark-app"/"app-1583110801067267339-driver". list of unmounted volumes=[spark-conf-volume]. list of unattached volumes=[spark-local-dir-1 spark-conf-volume spark-spark-token-bmdlg]

MountVolume.SetUp failed for volume "spark-conf-volume" : configmaps "app-1583083800800289009-1583083803653-driver-conf-map" not found

gouthamssc commented 4 years ago

@danielfree we are also facing the same issue. Did you found a way to make it work?

novotl commented 4 years ago

We are facing a similar issue and we had no failures. It seems to me like there is a broken connection between a driver pod and SparkApplication because they seem to be in an inconsistent state. Anyway, here is our ScheduledSparkApplication that stopped scheduling

> kubectl -n spark describe scheduledsparkapplications some-spark-app
...
Status:
  Last Run:       2020-04-23T04:47:45Z
  Last Run Name:  some-spark-app-1587617265304602851
  Next Run:       2020-04-23T05:47:45Z
  Past Successful Run Names:
    some-spark-app-1587613665284846569
  Schedule State:  Scheduled

the snapshot ^^ was taken way past 2020-04-23T05:47:45Z (time of the next run that never happened). Previously, there were 2 successful runs.

> kubectl -n spark get pods
some-spark-app-1587613665284846569-driver   0/1     Completed   0          6h32m
some-spark-app-1587617265304602851-driver   0/1     Completed   0          5h30m

To my understanding the ScheduledSparkApplication spawns the SparkApplication which spawns the driver and executor pods. This is were something weird is happening. So the SparkApplication from the first run is

> kubectl -n spark describe sparkapplication some-spark-app-1587613665284846569
...
Status:
  Application State:
    State:  COMPLETED
  Driver Info:
    Pod Name:             some-spark-app-1587613665284846569-driver
    Web UI Address:       172.20.142.172:4040
    Web UI Port:          4040
    Web UI Service Name:  some-spark-app-1587613665284846569-ui-svc
  Execution Attempts:     1
  Executor State:
    some-spark-app-1587613665284846569-1587613666789-exec-1:  FAILED
    some-spark-app-1587613665284846569-1587613666789-exec-2:  FAILED
  Last Submission Attempt Time:                                       2020-04-23T03:47:49Z
  Spark Application Id:                                               spark-c9f66b4148fd49a692d64b90652d7ada
  Submission Attempts:                                                1
  Submission ID:                                                      b3aca58d-f1ad-4e54-89e1-1ba11f779cda
  Termination Time:                                                   2020-04-23T03:48:27Z
Events:                                                               <none>

So we lost some executors, but it succesfully recovered and completed. The second run however

kubectl -n spark describe sparkapplication some-spark-app-1587617265304602851
...
Status:
  Application State:
    State:  PENDING_RERUN
  Driver Info:
  Execution Attempts:            1
  Last Submission Attempt Time:  <nil>
  Submission ID:                 893baefb-3f16-45f4-9c07-3c23e359a7f4
  Termination Time:              2020-04-23T04:48:54Z
Events:                          <none>

is stuck in the PENDING_RERUN state. And since we're using the concurrencyPolicy: Forbid we never schedule the next one.

EliranTurgeman commented 1 year ago

Hi Team, We have the same issue with version "v1beta2-1.3.3-3.1.1" (helm 1.1.6) did someone know if this issue was fixed in the latest version?

maitreyavvm commented 3 months ago

We are facing this issue with v1beta2-1.4.6-3.5.0. Any workarounds ?

github-actions[bot] commented 1 week ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.