kubeflow / spark-operator

Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.
Apache License 2.0
2.75k stars 1.36k forks source link

driver Pod is running, But executors Excute did not start Caused by: java.net.SocketTimeoutException: timeout #1561

Open allenhaozi opened 2 years ago

allenhaozi commented 2 years ago
++ id -u
+ myuid=0
++ id -g
+ mygid=0
+ set +e
++ getent passwd 0
+ uidentry=root:x:0:0:root:/root:/bin/bash
+ set -e
+ '[' -z root:x:0:0:root:/root:/bin/bash ']'
+ '[' -z '' ']'
++ java -XshowSettings:properties -version
++ grep java.home
++ awk '{print $3}'
+ JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
+ SPARK_CLASSPATH=':/usr/local/spark/jars/*'
+ env
+ grep SPARK_JAVA_OPT_
+ sort -t_ -k4 -n
+ sed 's/[^=]*=\(.*\)/\1/g'
+ readarray -t SPARK_EXECUTOR_JAVA_OPTS
+ '[' -n '' ']'
+ '[' -z ']'
+ '[' -z ']'
+ '[' -n '' ']'
+ '[' -z ']'
+ '[' -z x ']'
+ SPARK_CLASSPATH='/opt/spark/conf::/usr/local/spark/jars/*'
+ case "$1" in
+ shift 1
+ CMD=("$SPARK_HOME/bin/spark-submit" --conf "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" --deploy-mode client "$@")
+ exec /usr/bin/tini -s -- /usr/local/spark/bin/spark-submit --conf spark.driver.bindAddress=10.42.3.233 --deploy-mode client --properties-file /opt/spark/conf/spark.properties --class org.apache.spark.deploy.PythonRunner local:///home/ailake/work/src/LoadCustomer.py --assign_time 2022-06-15
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/usr/local/spark-3.2.1-bin-hadoop3.2/jars/spark-unsafe_2.12-3.2.1.jar) to constructor java.nio.DirectByteBuffer(long,int)
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
22/06/28 08:01:49 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
22/06/28 08:01:50 INFO SparkContext: Running Spark version 3.2.1
22/06/28 08:01:51 INFO ResourceUtils: ==============================================================
22/06/28 08:01:51 INFO ResourceUtils: No custom resources configured for spark.driver.
22/06/28 08:01:51 INFO ResourceUtils: ==============================================================
22/06/28 08:01:51 INFO SparkContext: Submitted application: MyApp
22/06/28 08:01:51 INFO ResourceProfile: Default ResourceProfile created, executor resources: Map(cores -> name: cores, amount: 8, script: , vendor: , memory -> name: memory, amount: 4000, script: , vendor: , offHeap -> name: offHeap, amount: 0, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0)
22/06/28 08:01:51 INFO ResourceProfile: Limiting resource is cpus at 8 tasks per executor
22/06/28 08:01:51 INFO ResourceProfileManager: Added ResourceProfile id: 0
22/06/28 08:01:51 INFO SecurityManager: Changing view acls to: root,ailake
22/06/28 08:01:51 INFO SecurityManager: Changing modify acls to: root,ailake
22/06/28 08:01:51 INFO SecurityManager: Changing view acls groups to: 
22/06/28 08:01:51 INFO SecurityManager: Changing modify acls groups to: 
22/06/28 08:01:51 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root, ailake); groups with view permissions: Set(); users  with modify permissions: Set(root, ailake); groups with modify permissions: Set()
22/06/28 08:01:51 INFO Utils: Successfully started service 'sparkDriver' on port 7078.
22/06/28 08:01:51 INFO SparkEnv: Registering MapOutputTracker
22/06/28 08:01:51 INFO SparkEnv: Registering BlockManagerMaster
22/06/28 08:01:51 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
22/06/28 08:01:51 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
22/06/28 08:01:51 INFO SparkEnv: Registering BlockManagerMasterHeartbeat
22/06/28 08:01:51 INFO DiskBlockManager: Created local directory at /var/data/spark-eb48620c-3ce4-485c-8b76-8efff63a2e7d/blockmgr-d9067949-8e03-49c1-86c3-4ad0596da84c
22/06/28 08:01:51 INFO MemoryStore: MemoryStore started with capacity 1020.0 MiB
22/06/28 08:01:51 INFO SparkEnv: Registering OutputCommitCoordinator
22/06/28 08:01:51 INFO Utils: Successfully started service 'SparkUI' on port 4040.
22/06/28 08:01:52 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://load-data-1656403191-6649623-95ec8781a9549c42-driver-svc.airflow.svc:4040
22/06/28 08:01:52 INFO SparkKubernetesClientFactory: Auto-configuring K8S client using current context from users K8S config file
22/06/28 08:01:53 INFO ExecutorPodsAllocator: Going to request 3 executors from Kubernetes for ResourceProfile Id: 0, target: 3, known: 0, sharedSlotFromPendingPods: 2147483647.
22/06/28 08:01:53 INFO BasicExecutorFeatureStep: Decommissioning not enabled, skipping shutdown script
22/06/28 08:01:53 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 7079.
22/06/28 08:01:53 INFO NettyBlockTransferService: Server created on load-data-1656403191-6649623-95ec8781a9549c42-driver-svc.airflow.svc:7079
22/06/28 08:01:53 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
22/06/28 08:01:53 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, load-data-1656403191-6649623-95ec8781a9549c42-driver-svc.airflow.svc, 7079, None)
22/06/28 08:01:53 INFO BlockManagerMasterEndpoint: Registering block manager load-data-1656403191-6649623-95ec8781a9549c42-driver-svc.airflow.svc:7079 with 1020.0 MiB RAM, BlockManagerId(driver, load-data-1656403191-6649623-95ec8781a9549c42-driver-svc.airflow.svc, 7079, None)
22/06/28 08:01:53 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, load-data-1656403191-6649623-95ec8781a9549c42-driver-svc.airflow.svc, 7079, None)
22/06/28 08:01:53 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, load-data-1656403191-6649623-95ec8781a9549c42-driver-svc.airflow.svc, 7079, None)
22/06/28 08:02:03 WARN ExecutorPodsSnapshotsStoreImpl: Exception when notifying snapshot subscriber.
io.fabric8.kubernetes.client.KubernetesClientException: Operation: [create]  for kind: [Pod]  with name: [null]  in namespace: [airflow]  failed.
        at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64)
        at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:72)
        at io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:380)
        at io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:86)
        at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$requestNewExecutors$1(ExecutorPodsAllocator.scala:400)
        at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158)
        at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.requestNewExecutors(ExecutorPodsAllocator.scala:382)
        at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$onNewSnapshots$36(ExecutorPodsAllocator.scala:346)
        at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$onNewSnapshots$36$adapted(ExecutorPodsAllocator.scala:339)
        at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
        at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
        at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.onNewSnapshots(ExecutorPodsAllocator.scala:339)
        at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$start$3(ExecutorPodsAllocator.scala:117)
        at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$start$3$adapted(ExecutorPodsAllocator.scala:117)
        at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsSnapshotsStoreImpl$SnapshotsSubscriber.org$apache$spark$scheduler$cluster$k8s$ExecutorPodsSnapshotsStoreImpl$SnapshotsSubscriber$$processSnapshotsInternal(ExecutorPodsSnapshotsStoreImpl.scala:138)
        at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsSnapshotsStoreImpl$SnapshotsSubscriber.processSnapshots(ExecutorPodsSnapshotsStoreImpl.scala:126)
        at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsSnapshotsStoreImpl.$anonfun$addSubscriber$1(ExecutorPodsSnapshotsStoreImpl.scala:81)
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
        at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
        at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.net.SocketTimeoutException: timeout
        at okhttp3.internal.http2.Http2Stream$StreamTimeout.newTimeoutException(Http2Stream.java:672)
        at okhttp3.internal.http2.Http2Stream$StreamTimeout.exitAndThrowIfTimedOut(Http2Stream.java:680)
        at okhttp3.internal.http2.Http2Stream.takeHeaders(Http2Stream.java:153)
        at okhttp3.internal.http2.Http2Codec.readResponseHeaders(Http2Codec.java:131)
        at okhttp3.internal.http.CallServerInterceptor.intercept(CallServerInterceptor.java:88)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
        at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:45)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
        at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
        at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
        at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:127)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
        at io.fabric8.kubernetes.client.utils.BackwardsCompatibilityInterceptor.intercept(BackwardsCompatibilityInterceptor.java:133)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
        at io.fabric8.kubernetes.client.utils.TokenRefreshInterceptor.intercept(TokenRefreshInterceptor.java:42)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
        at io.fabric8.kubernetes.client.utils.ImpersonatorInterceptor.intercept(ImpersonatorInterceptor.java:68)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
        at io.fabric8.kubernetes.client.utils.HttpClientUtils.lambda$createApplicableInterceptors$6(HttpClientUtils.java:284)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
        at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:257)
        at okhttp3.RealCall.execute(RealCall.java:93)
        at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:541)
        at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:504)
        at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleCreate(OperationSupport.java:292)
        at io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:893)
        at io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:372)
        ... 21 more
22/06/28 08:02:04 INFO ExecutorPodsAllocator: Going to request 3 executors from Kubernetes for ResourceProfile Id: 0, target: 3, known: 0, sharedSlotFromPendingPods: 2147483647.
22/06/28 08:02:04 INFO BasicExecutorFeatureStep: Decommissioning not enabled, skipping shutdown script
22/06/28 08:02:14 WARN ExecutorPodsSnapshotsStoreImpl: Exception when notifying snapshot subscriber.
dcoliversun commented 2 years ago

It looks like the connection to the APIServer has timed out, so check the APIServer logs.

allenhaozi commented 2 years ago

thank you @dcoliversun If I kill the driver pod manually, it will work. It is an accidental event

github-actions[bot] commented 1 week ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.