kubeflow / spark-operator

Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.
Apache License 2.0
2.74k stars 1.36k forks source link

KubernetesClientException: too old resource version #1498

Open devender-yadav opened 2 years ago

devender-yadav commented 2 years ago

Image used : gcr.io/spark-operator/spark:v3.1.1 kubernetes client jar: kubernetes-client-4.12.0.jar

We are getting this issue intermittently

Relevant Logs:

 io.fabric8.kubernetes.client.KubernetesClientException: too old resource version
    at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onMessage(WatchConnectionManager.java:258)
    at okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:323)
    at okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:219)
    at okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:105)
    at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274)
    at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214)
    at okhttp3.RealCall$AsyncCall.execute(RealCall.java:203)
    at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.base/java.lang.Thread.run(Unknown Source)

Any pointers on how to fix this?

devender-yadav commented 2 years ago

Relevant discussion - https://issues.apache.org/jira/browse/SPARK-33349

cmoad commented 2 years ago

We're seeing the exact same issue with pyspark v3.2.1. The streaming jobs are just stalling instead of the driver quitting so the job can start again.

22/03/28 19:57:19 WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed (this is expected if the application is shutting down.)
io.fabric8.kubernetes.client.WatcherException: too old resource version: 1499049025 (1499196141)
    at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$TypedWatcherWebSocketListener.onMessage(WatchConnectionManager.java:103)
    at okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:323)
    at okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:219)
    at okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:105)
    at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274)
    at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214)
    at okhttp3.RealCall$AsyncCall.execute(RealCall.java:203)
    at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.base/java.lang.Thread.run(Unknown Source)
Caused by: io.fabric8.kubernetes.client.KubernetesClientException: too old resource version: 1499049025 (1499196141)
    ... 11 more

Update: This message usually appears 10-20 minutes after the log item right before it so it may be a red herring. Unfortunately there are no errors in the log before it so there's nothing else obvious to share.

devender-yadav commented 2 years ago

@cmoad what's your k8 client and server version?

cmoad commented 2 years ago

Server version: v1.21.9 Client version: kubernetes-client-5.4.1.jar

Should be good based on the compatibility matrix: https://github.com/fabric8io/kubernetes-client#kubernetes-compatibility-matrix

devrivne commented 2 years ago

Faced the same issue. Spark hangs forever right after the writing to parquet stage ended.

kubernetes-client-5.4.1 Server Version: version.Info{

Major:"1", Minor:"21+", GitVersion:"v1.21.5-eks-bc4871b", GitCommit:"5236faf39f1b7a7dabea8df12726f25608131aa9"

cmoad commented 2 years ago

Fwiw, we now believe this error was a red herring. We found the true cause to be a quiet OOM on an executor and this error seemed to happen a while later in the driver logs. Anyone else seeing this should heavily look for errors happening ~20-30 seconds before.

devrivne commented 2 years ago

Yep, I had OOM issues that preceded the above error. Bumping up the version to 5.50 helped to catch it. No app hanging, just usual OOM crash.

qshian commented 2 years ago

Server version: v1.20.15-eks-84b4fe6 Client version: kubernetes-client-5.4.1.jar We are getting this issue intermittently


2022/08/28 19:51:20 INFO SparkTBinaryFrontendService: Client protocol version: HIVE_CLI_SERVICE_PROTOCOL_V6 2022/08/28 19:51:20 INFO SparkSQLSessionManager: Opening session for hive@172.31.37.65 2022/08/28 19:51:20 WARN SparkSessionImpl: Cannot modify the value of a Spark config: spark.driver.memory 2022/08/28 19:51:20 INFO SparkSQLSessionManager: hive's session with SessionHandle [4465a33f-8ac6-4f0b-bccd-dd4702ee0b7b] is opened, current opening sessions 1 2022/08/28 20:42:34 WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed (this is expected if the application is shutting down.) io.fabric8.kubernetes.client.WatcherException: too old resource version: 61967434 (62009935) at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$TypedWatcherWebSocketListener.onMessage(WatchConnectionManager.java:103) at okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:323) at okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:219) at okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:105) at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274) at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214) at okhttp3.RealCall$AsyncCall.execute(RealCall.java:203) at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.base/java.lang.Thread.run(Unknown Source) Caused by: io.fabric8.kubernetes.client.KubernetesClientException: too old resource version: 61967434 (62009935) ... 11 more

wang007 commented 2 years ago

Are your set AllowWatchBookmarks param for watch opt ? https://kubernetes.io/docs/reference/using-api/api-concepts/

pedro93 commented 1 year ago

Hello,

I have a similar situation with my spark application, here are the relevant logs:

22/10/20 11:10:32 INFO BlockManagerMasterEndpoint: Registering block manager 10.144.4.100:46847 with 2.2 GiB RAM, BlockManagerId(1, 10.144.4.100, 46847, None)
22/10/20 11:10:33 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir.
22/10/20 11:10:33 INFO SharedState: Warehouse path is 'file:/opt/spark/work-dir/spark-warehouse'.
22/10/20 11:44:46 WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed (this is expected if the application is shutting down.)
io.fabric8.kubernetes.client.WatcherException: too old resource version: 58557205 (58564779)
    at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$TypedWatcherWebSocketListener.onMessage(WatchConnectionManager.java:103)
    at okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:323)
    at okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:219)
    at okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:105)
    at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274)
    at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214)
    at okhttp3.RealCall$AsyncCall.execute(RealCall.java:203)
    at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.base/java.lang.Thread.run(Unknown Source)
Caused by: io.fabric8.kubernetes.client.KubernetesClientException: too old resource version: 58557205 (58564779)
    ... 11 more
22/10/20 12:01:37 INFO CodeGenerator: Code generated in 516.484966 ms

I don't have any other exception or error message in my logs, driver or executors. Is there something I can try?

Thank you.

sbbagal13 commented 1 year ago

Hello All, I am also facing same issue, Anyone having any workaround solution for it ?

cmoad commented 1 year ago

I would recommend we close this issue. Several of us have reported this error message as downstream from the true, critical failure.

sbbagal13 commented 1 year ago

@cmoad what is the solution for this?

noahshpak commented 1 year ago

if you are using gcs, after upgrading to 3.3.0 I no longer see the too old resource version and started seeing see that the "hanging" behavior was actually spark repairing a bunch of directories in my bucket. https://groups.google.com/g/cloud-dataproc-discuss/c/JKcimdnskJc recommends setting "fs.gs.implicit.dir.repair.enable" to False

github-actions[bot] commented 1 week ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.