kubeflow / spark-operator

Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.
Apache License 2.0
2.78k stars 1.38k forks source link

Executor : Failed to connect to spark-pi-1610000110552-driver-svc.namespace.svc:7078 #1124

Open SandishKumarHN opened 3 years ago

SandishKumarHN commented 3 years ago

Error:

21/01/07 06:15:30 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set() Exception in thread "main" java.lang.reflect.UndeclaredThrowableException at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1713) at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:64) at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:188) at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:285) at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala) Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult: at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226) at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:101) at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:201) at org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:65) at org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:64) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698) ... 4 more Caused by: java.io.IOException: Failed to connect to spark-pi-1610000110552-driver-svc.optik-ae.svc:7078 at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:245) at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:187) at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:198) at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:194) at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:190) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.net.UnknownHostException: spark-pi-1610000110552-driver-svc.optik-ae.svc at java.net.InetAddress.getAllByName0(InetAddress.java:1281) at java.net.InetAddress.getAllByName(InetAddress.java:1193) at java.net.InetAddress.getAllByName(InetAddress.java:1127) at java.net.InetAddress.getByName(InetAddress.java:1077) at io.netty.util.internal.SocketUtils$8.run(SocketUtils.java:146) at io.netty.util.internal.SocketUtils$8.run(SocketUtils.java:143) at java.security.AccessController.doPrivileged(Native Method) at io.netty.util.internal.SocketUtils.addressByName(SocketUtils.java:143) at io.netty.resolver.DefaultNameResolver.doResolve(DefaultNameResolver.java:43) at io.netty.resolver.SimpleNameResolver.resolve(SimpleNameResolver.java:63) at io.netty.resolver.SimpleNameResolver.resolve(SimpleNameResolver.java:55) at io.netty.resolver.InetSocketAddressResolver.doResolve(InetSocketAddressResolver.java:57) at io.netty.resolver.InetSocketAddressResolver.doResolve(InetSocketAddressResolver.java:32) at io.netty.resolver.AbstractAddressResolver.resolve(AbstractAddressResolver.java:108) at io.netty.bootstrap.Bootstrap.doResolveAndConnect0(Bootstrap.java:202) at io.netty.bootstrap.Bootstrap.access$000(Bootstrap.java:48) at io.netty.bootstrap.Bootstrap$1.operationComplete(Bootstrap.java:182) at io.netty.bootstrap.Bootstrap$1.operationComplete(Bootstrap.java:168) at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:577) at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:551) at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:490) at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:615) at io.netty.util.concurrent.DefaultPromise.setSuccess0(DefaultPromise.java:604) at io.netty.util.concurrent.DefaultPromise.trySuccess(DefaultPromise.java:104) at io.netty.channel.DefaultChannelPromise.trySuccess(DefaultChannelPromise.java:84) at io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetSuccess(AbstractChannel.java:985) at io.netty.channel.AbstractChannel$AbstractUnsafe.register0(AbstractChannel.java:505) at io.netty.channel.AbstractChannel$AbstractUnsafe.access$200(AbstractChannel.java:416) at io.netty.channel.AbstractChannel$AbstractUnsafe$1.run(AbstractChannel.java:475) at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163) at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:510) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:518) at io.netty.util.concurrent.SingleThreadEventExecutor$6.run(SingleThreadEventExecutor.java:1044) at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)

Service description:

ubuntu$ kubectl describe svc spark-pi-ui-svc -n optik-ae Name: spark-pi-ui-svc Namespace: optik-ae Labels: sparkoperator.k8s.io/app-name=spark-pi sparkoperator.k8s.io/submission-id=49d67c2f-3166-4ce3-aa4a-3058baa2108e Annotations: Selector: spark-role=driver,sparkoperator.k8s.io/app-name=spark-pi Type: ClusterIP IP: 10.3.2.245 Port: spark-driver-ui-port 4040/TCP TargetPort: 4040/TCP Endpoints: 10.2.174.72:4040 Session Affinity: None Events:

ubuntu$ kubectl describe svc spark-pi-1610000110552-driver-svc -n optik-ae Name: spark-pi-1610000110552-driver-svc Namespace: optik-ae Labels: Annotations: Selector: spark-app-selector=spark-7554d84c993644f6b2838ac0c2fd5431,spark-role=driver,sparkoperator.k8s.io/app-name=spark-pi,sparkoperator.k8s.io/launched-by-spark-operator=true,sparkoperator.k8s.io/submission-id=49d67c2f-3166-4ce3-aa4a-3058baa2108e,version=2.4.5 Type: ClusterIP IP: None Port: driver-rpc-port 7078/TCP TargetPort: 7078/TCP Endpoints: 10.2.174.72:7078 Port: blockmanager 7079/TCP TargetPort: 7079/TCP Endpoints: 10.2.174.72:7079 Session Affinity: None Events:

As you see spark-pi-1610000110552-driver-svc service did not get assigned a clusterIP, is that the problem causing executor to throw Failed to connect to spark-pi-1610000110552-driver-svc.optik-ae.svc:7078

Can someone help me?

SandishKumarHN commented 3 years ago

we had an istio enabled on our k8s after adding this annotation everything started working "sidecar.istio.io/inject"] = "false"

rubenbarragan commented 3 years ago

Can you please elaborate more on your settings? What's the version of your operator? Is it a local cluster, a provider?

github-actions[bot] commented 2 weeks ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.