Executor pods are stuck in ERROR status and following messages are observed in driver pod log:
2021-12-07T15:14:13.614 [Timer-0hread] WARN org.apache.spark.scheduler.TaskSchedulerImpl - Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
2021-12-07T15:18:12.301 [OkHttp https://kubernetes.default.svc/...hread] WARN io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager - Exec Failure
javax.net.ssl.SSLException: Connection reset
at sun.security.ssl.Alert.createSSLException(Alert.java:127)
at sun.security.ssl.TransportContext.fatal(TransportContext.java:324)
at sun.security.ssl.TransportContext.fatal(TransportContext.java:267)
at sun.security.ssl.TransportContext.fatal(TransportContext.java:262)
at sun.security.ssl.SSLSocketImpl.handleException(SSLSocketImpl.java:1563)
at sun.security.ssl.SSLSocketImpl.access$400(SSLSocketImpl.java:73)
at sun.security.ssl.SSLSocketImpl$AppInputStream.read(SSLSocketImpl.java:973)
at okio.Okio$2.read(Okio.java:139)
at okio.AsyncTimeout$2.read(AsyncTimeout.java:237)
at okio.RealBufferedSource.request(RealBufferedSource.java:67)
at okio.RealBufferedSource.require(RealBufferedSource.java:60)
at okio.RealBufferedSource.readByte(RealBufferedSource.java:73)
at okhttp3.internal.ws.WebSocketReader.readHeader(WebSocketReader.java:113)
at okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:97)
at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:262)
at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:201)
at okhttp3.RealCall$AsyncCall.execute(RealCall.java:141)
at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Suppressed: java.net.SocketException: Broken pipe (Write failed)
at java.net.SocketOutputStream.socketWrite0(Native Method)
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111)
at java.net.SocketOutputStream.write(SocketOutputStream.java:155)
at sun.security.ssl.SSLSocketOutputRecord.encodeAlert(SSLSocketOutputRecord.java:81)
at sun.security.ssl.TransportContext.fatal(TransportContext.java:355)
... 19 more
Caused by: java.net.SocketException: Connection reset
at java.net.SocketInputStream.read(SocketInputStream.java:210)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at sun.security.ssl.SSLSocketInputRecord.read(SSLSocketInputRecord.java:464)
at sun.security.ssl.SSLSocketInputRecord.bytesInCompletePacket(SSLSocketInputRecord.java:68)
at sun.security.ssl.SSLSocketImpl.readApplicationRecord(SSLSocketImpl.java:1341)
at sun.security.ssl.SSLSocketImpl.access$300(SSLSocketImpl.java:73)
at sun.security.ssl.SSLSocketImpl$AppInputStream.read(SSLSocketImpl.java:957)
... 14 more
2021-12-07T15:18:13.347 [OkHttp https://kubernetes.default.svc/...hread] WARN org.apache.spark.scheduler.cluster.k8s.ExecutorPodsWatchSnapshotSource - Kubernetes client has been closed (this is expected if the application is shutting down.)
io.fabric8.kubernetes.client.KubernetesClientException: too old resource version: 1574634 (1575608)
at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onMessage(WatchConnectionManager.java:259)
at okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:307)
at okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:222)
at okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:101)
at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:262)
at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:201)
at okhttp3.RealCall$AsyncCall.execute(RealCall.java:141)
at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
I think in this case the driver pod needs to be automatically killed and re-schedule the sparkapplications.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Hi Expert,
This might be an issue similar to https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/issues/895. I sometimes hit this issue when multiple sparkapplications are being scheduled at the same time:
Executor pods are stuck in ERROR status and following messages are observed in driver pod log:
I think in this case the driver pod needs to be automatically killed and re-schedule the sparkapplications.
Thanks, Flo