Open duongnt opened 4 years ago
we have seen similar situations with ScheduledSparkApplication, the new driver pod did not come up due to errors like this:
Unable to mount volumes for pod "app-1583110801067267339-driver_spark-app(27f42810-5c21-11ea-8dc4-525400048875)": timeout expired waiting for volumes to attach or mount for pod "spark-app"/"app-1583110801067267339-driver". list of unmounted volumes=[spark-conf-volume]. list of unattached volumes=[spark-local-dir-1 spark-conf-volume spark-spark-token-bmdlg]
MountVolume.SetUp failed for volume "spark-conf-volume" : configmaps "app-1583083800800289009-1583083803653-driver-conf-map" not found
@danielfree we are also facing the same issue. Did you found a way to make it work?
We are facing a similar issue and we had no failures. It seems to me like there is a broken connection between a driver pod and SparkApplication
because they seem to be in an inconsistent state. Anyway, here is our ScheduledSparkApplication
that stopped scheduling
> kubectl -n spark describe scheduledsparkapplications some-spark-app
...
Status:
Last Run: 2020-04-23T04:47:45Z
Last Run Name: some-spark-app-1587617265304602851
Next Run: 2020-04-23T05:47:45Z
Past Successful Run Names:
some-spark-app-1587613665284846569
Schedule State: Scheduled
the snapshot ^^ was taken way past 2020-04-23T05:47:45Z
(time of the next run that never happened). Previously, there were 2 successful runs.
> kubectl -n spark get pods
some-spark-app-1587613665284846569-driver 0/1 Completed 0 6h32m
some-spark-app-1587617265304602851-driver 0/1 Completed 0 5h30m
To my understanding the ScheduledSparkApplication
spawns the SparkApplication
which spawns the driver and executor pods. This is were something weird is happening. So the SparkApplication
from the first run is
> kubectl -n spark describe sparkapplication some-spark-app-1587613665284846569
...
Status:
Application State:
State: COMPLETED
Driver Info:
Pod Name: some-spark-app-1587613665284846569-driver
Web UI Address: 172.20.142.172:4040
Web UI Port: 4040
Web UI Service Name: some-spark-app-1587613665284846569-ui-svc
Execution Attempts: 1
Executor State:
some-spark-app-1587613665284846569-1587613666789-exec-1: FAILED
some-spark-app-1587613665284846569-1587613666789-exec-2: FAILED
Last Submission Attempt Time: 2020-04-23T03:47:49Z
Spark Application Id: spark-c9f66b4148fd49a692d64b90652d7ada
Submission Attempts: 1
Submission ID: b3aca58d-f1ad-4e54-89e1-1ba11f779cda
Termination Time: 2020-04-23T03:48:27Z
Events: <none>
So we lost some executors, but it succesfully recovered and completed. The second run however
kubectl -n spark describe sparkapplication some-spark-app-1587617265304602851
...
Status:
Application State:
State: PENDING_RERUN
Driver Info:
Execution Attempts: 1
Last Submission Attempt Time: <nil>
Submission ID: 893baefb-3f16-45f4-9c07-3c23e359a7f4
Termination Time: 2020-04-23T04:48:54Z
Events: <none>
is stuck in the PENDING_RERUN
state. And since we're using the concurrencyPolicy: Forbid
we never schedule the next one.
Hi Team, We have the same issue with version "v1beta2-1.3.3-3.1.1" (helm 1.1.6) did someone know if this issue was fixed in the latest version?
We are facing this issue with v1beta2-1.4.6-3.5.0. Any workarounds ?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
I have a ScheduledSparkApplication instance that was running well. However at some point, it stopped creating new SparkApplication instance. Here are some relevant excerpt of the log:
The status of ScheduledSparkApplication looks like this:
So looks like the run
spark-app-1580705022570653626
is considered neither failed nor successful, and that blocks the next run from being started? I'm not sure whyspark-app-1580705022570653626
wasn't able to start either, but it definitely left the ScheduledSparkApplication deployment in a bad state.