kubeflow / spark-operator

Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.
Apache License 2.0
2.75k stars 1.37k forks source link

Executor pod in Error status won't automatically get recreated #1428

Open FloraZhang opened 2 years ago

FloraZhang commented 2 years ago

Hi Expert,

This might be an issue similar to https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/issues/895. I sometimes hit this issue when multiple sparkapplications are being scheduled at the same time:

Executor pods are stuck in ERROR status and following messages are observed in driver pod log:

2021-12-07T15:14:13.614 [Timer-0hread]  WARN org.apache.spark.scheduler.TaskSchedulerImpl - Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
2021-12-07T15:18:12.301 [OkHttp https://kubernetes.default.svc/...hread]  WARN io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager - Exec Failure
javax.net.ssl.SSLException: Connection reset
    at sun.security.ssl.Alert.createSSLException(Alert.java:127)
    at sun.security.ssl.TransportContext.fatal(TransportContext.java:324)
    at sun.security.ssl.TransportContext.fatal(TransportContext.java:267)
    at sun.security.ssl.TransportContext.fatal(TransportContext.java:262)
    at sun.security.ssl.SSLSocketImpl.handleException(SSLSocketImpl.java:1563)
    at sun.security.ssl.SSLSocketImpl.access$400(SSLSocketImpl.java:73)
    at sun.security.ssl.SSLSocketImpl$AppInputStream.read(SSLSocketImpl.java:973)
    at okio.Okio$2.read(Okio.java:139)
    at okio.AsyncTimeout$2.read(AsyncTimeout.java:237)
    at okio.RealBufferedSource.request(RealBufferedSource.java:67)
    at okio.RealBufferedSource.require(RealBufferedSource.java:60)
    at okio.RealBufferedSource.readByte(RealBufferedSource.java:73)
    at okhttp3.internal.ws.WebSocketReader.readHeader(WebSocketReader.java:113)
    at okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:97)
    at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:262)
    at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:201)
    at okhttp3.RealCall$AsyncCall.execute(RealCall.java:141)
    at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
    Suppressed: java.net.SocketException: Broken pipe (Write failed)
        at java.net.SocketOutputStream.socketWrite0(Native Method)
        at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111)
        at java.net.SocketOutputStream.write(SocketOutputStream.java:155)
        at sun.security.ssl.SSLSocketOutputRecord.encodeAlert(SSLSocketOutputRecord.java:81)
        at sun.security.ssl.TransportContext.fatal(TransportContext.java:355)
        ... 19 more
Caused by: java.net.SocketException: Connection reset
    at java.net.SocketInputStream.read(SocketInputStream.java:210)
    at java.net.SocketInputStream.read(SocketInputStream.java:141)
    at sun.security.ssl.SSLSocketInputRecord.read(SSLSocketInputRecord.java:464)
    at sun.security.ssl.SSLSocketInputRecord.bytesInCompletePacket(SSLSocketInputRecord.java:68)
    at sun.security.ssl.SSLSocketImpl.readApplicationRecord(SSLSocketImpl.java:1341)
    at sun.security.ssl.SSLSocketImpl.access$300(SSLSocketImpl.java:73)
    at sun.security.ssl.SSLSocketImpl$AppInputStream.read(SSLSocketImpl.java:957)
    ... 14 more
2021-12-07T15:18:13.347 [OkHttp https://kubernetes.default.svc/...hread]  WARN org.apache.spark.scheduler.cluster.k8s.ExecutorPodsWatchSnapshotSource - Kubernetes client has been closed (this is expected if the application is shutting down.)
io.fabric8.kubernetes.client.KubernetesClientException: too old resource version: 1574634 (1575608)
    at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onMessage(WatchConnectionManager.java:259)
    at okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:307)
    at okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:222)
    at okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:101)
    at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:262)
    at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:201)
    at okhttp3.RealCall$AsyncCall.execute(RealCall.java:141)
    at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

I think in this case the driver pod needs to be automatically killed and re-schedule the sparkapplications.

Thanks, Flo

github-actions[bot] commented 6 days ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.