Open devender-yadav opened 2 years ago
Relevant discussion - https://issues.apache.org/jira/browse/SPARK-33349
We're seeing the exact same issue with pyspark v3.2.1. The streaming jobs are just stalling instead of the driver quitting so the job can start again.
22/03/28 19:57:19 WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed (this is expected if the application is shutting down.)
io.fabric8.kubernetes.client.WatcherException: too old resource version: 1499049025 (1499196141)
at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$TypedWatcherWebSocketListener.onMessage(WatchConnectionManager.java:103)
at okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:323)
at okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:219)
at okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:105)
at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274)
at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214)
at okhttp3.RealCall$AsyncCall.execute(RealCall.java:203)
at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.base/java.lang.Thread.run(Unknown Source)
Caused by: io.fabric8.kubernetes.client.KubernetesClientException: too old resource version: 1499049025 (1499196141)
... 11 more
Update: This message usually appears 10-20 minutes after the log item right before it so it may be a red herring. Unfortunately there are no errors in the log before it so there's nothing else obvious to share.
@cmoad what's your k8 client and server version?
Server version: v1.21.9 Client version: kubernetes-client-5.4.1.jar
Should be good based on the compatibility matrix: https://github.com/fabric8io/kubernetes-client#kubernetes-compatibility-matrix
Faced the same issue. Spark hangs forever right after the writing to parquet stage ended.
kubernetes-client-5.4.1 Server Version: version.Info{
Major:"1", Minor:"21+", GitVersion:"v1.21.5-eks-bc4871b", GitCommit:"5236faf39f1b7a7dabea8df12726f25608131aa9"
Fwiw, we now believe this error was a red herring. We found the true cause to be a quiet OOM on an executor and this error seemed to happen a while later in the driver logs. Anyone else seeing this should heavily look for errors happening ~20-30 seconds before.
Yep, I had OOM issues that preceded the above error. Bumping up the version to 5.50 helped to catch it. No app hanging, just usual OOM crash.
Server version: v1.20.15-eks-84b4fe6 Client version: kubernetes-client-5.4.1.jar We are getting this issue intermittently
2022/08/28 19:51:20 INFO SparkTBinaryFrontendService: Client protocol version: HIVE_CLI_SERVICE_PROTOCOL_V6 2022/08/28 19:51:20 INFO SparkSQLSessionManager: Opening session for hive@172.31.37.65 2022/08/28 19:51:20 WARN SparkSessionImpl: Cannot modify the value of a Spark config: spark.driver.memory 2022/08/28 19:51:20 INFO SparkSQLSessionManager: hive's session with SessionHandle [4465a33f-8ac6-4f0b-bccd-dd4702ee0b7b] is opened, current opening sessions 1 2022/08/28 20:42:34 WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed (this is expected if the application is shutting down.) io.fabric8.kubernetes.client.WatcherException: too old resource version: 61967434 (62009935) at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$TypedWatcherWebSocketListener.onMessage(WatchConnectionManager.java:103) at okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:323) at okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:219) at okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:105) at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274) at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214) at okhttp3.RealCall$AsyncCall.execute(RealCall.java:203) at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.base/java.lang.Thread.run(Unknown Source) Caused by: io.fabric8.kubernetes.client.KubernetesClientException: too old resource version: 61967434 (62009935) ... 11 more
Are your set AllowWatchBookmarks param for watch opt ? https://kubernetes.io/docs/reference/using-api/api-concepts/
Hello,
I have a similar situation with my spark application, here are the relevant logs:
22/10/20 11:10:32 INFO BlockManagerMasterEndpoint: Registering block manager 10.144.4.100:46847 with 2.2 GiB RAM, BlockManagerId(1, 10.144.4.100, 46847, None)
22/10/20 11:10:33 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir.
22/10/20 11:10:33 INFO SharedState: Warehouse path is 'file:/opt/spark/work-dir/spark-warehouse'.
22/10/20 11:44:46 WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed (this is expected if the application is shutting down.)
io.fabric8.kubernetes.client.WatcherException: too old resource version: 58557205 (58564779)
at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$TypedWatcherWebSocketListener.onMessage(WatchConnectionManager.java:103)
at okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:323)
at okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:219)
at okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:105)
at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274)
at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214)
at okhttp3.RealCall$AsyncCall.execute(RealCall.java:203)
at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.base/java.lang.Thread.run(Unknown Source)
Caused by: io.fabric8.kubernetes.client.KubernetesClientException: too old resource version: 58557205 (58564779)
... 11 more
22/10/20 12:01:37 INFO CodeGenerator: Code generated in 516.484966 ms
I don't have any other exception or error message in my logs, driver or executors. Is there something I can try?
Thank you.
Hello All, I am also facing same issue, Anyone having any workaround solution for it ?
I would recommend we close this issue. Several of us have reported this error message as downstream from the true, critical failure.
@cmoad what is the solution for this?
if you are using gcs, after upgrading to 3.3.0 I no longer see the too old resource version
and started seeing see that the "hanging" behavior was actually spark repairing a bunch of directories in my bucket. https://groups.google.com/g/cloud-dataproc-discuss/c/JKcimdnskJc recommends setting "fs.gs.implicit.dir.repair.enable"
to False
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Image used :
gcr.io/spark-operator/spark:v3.1.1
kubernetes client jar:kubernetes-client-4.12.0.jar
We are getting this issue intermittently
Relevant Logs:
Any pointers on how to fix this?