Open AceHack opened 4 years ago
@AceHack It's a little different in Istio. Istio don't support headless kubernetes svc by default. @liyinan926 Should we expose how to create the driver svc ?
Istio does support headless services as of 1.4 https://github.com/istio/istio/issues/7495
Istio does not work with job like things in Kubernetes https://github.com/istio/istio/issues/6324
The problem is spark would need to use something like envoy pre-flight to properly wait for envoy to come up and then shut down pre-flight when the spark application ends.
https://github.com/monzo/envoy-preflight
We are using envoy-preflight for all other job in Kubernetes.
One work around that we tested is configuring the application to run outside the istio service mesh.
Running outside the service mesh is not really a valid work around we want all the observability a service mesh brings with our spark jobs.
Working with the upstream community, we have added a configuration workaround for Zookeeper along with Casssandra, Elasticsearch, Redis and Apache NiFi. I am sure there are additional applications that are not compatible with sidecars. Please inform the community if you know of any.
https://www.cncf.io/blog/2020/10/26/service-mesh-is-still-hard/
We have yet to really dig into enhancements in Spark 3, but we may need to accept that we are apart of the "applications that are not compatible with sidecars."
hi, what is exactly incompatible? we added an istio waiter to the image and executor starts. But then fails with:
20/11/19 07:55:58 INFO TransportClientFactory: Successfully created connection to spark.spark.svc/10.X.X.X:7078 after 99 ms (0 ms spent in bootstraps) 20/11/19 07:55:58 ERROR TransportResponseHandler: Still have 1 requests outstanding when connection from spark.spark.svc/10.X.X.X:7078 is closed Exception in thread "main" java.lang.reflect.UndeclaredThrowableException at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1713) at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:64) at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:188) at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:281) at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala) Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult: at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226) at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:101) at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:201) at org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:65) at org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:64) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698) ... 4 more Caused by: java.io.IOException: Connection from spark.spark.svc/10.X.X.X:7078 closed at org.apache.spark.network.client.TransportResponseHandler.channelInactive(TransportResponseHandler.java:146) at org.apache.spark.network.server.TransportChannelHandler.channelInactive(TransportChannelHandler.java:108) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231) at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224) at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75) at io.netty.handler.timeout.IdleStateHandler.channelInactive(IdleStateHandler.java:277) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231) at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224) at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231) at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224) at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75) at org.apache.spark.network.util.TransportFrameDecoder.channelInactive(TransportFrameDecoder.java:182) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231) at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224) at io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1354) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231) at io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:917) at io.netty.channel.AbstractChannel$AbstractUnsafe$8.run(AbstractChannel.java:822) at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163) at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463) at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138) at java.lang.Thread.run(Thread.java:748)
What can it be?
Its discussed here: https://github.com/istio/istio/issues/27900. So may be Istio has a fix in upstream but may have not released it yet.
Its discussed here: istio/istio#27900. So may be Istio has a fix in upstream but may have not released it yet.
There's nothing for Istio to fix there - running Spark in vanilla standalone cluster mode thru Istio requires a small amount of Spark master configuration is all - I've tested Spark and Istio in that fashion and it works fine, workers talk to masters thru Istio via headless services.
It seems like there's actually 2 potential smaller issues here?
Spark workers can't talk to Spark masters with Istio and spark-on-k8s-operator
- if this is an issue people are seeing, it's a Spark configuration issue, not a bug on the Istio side - logs would be a big help.
Istio (really pod-level Envoy proxies) doesn't work with transient workloads like Jobs - this is correct, and there are a variety of workarounds for this issue - it's unlikely Istio will come up with a fix for this since Envoy proxies by design do not run to completion and exit with a return code of 0, and workloads being able to terminate their own proxy is a violation of the security model and not something I see Istio itself supporting in the long term (though k8s itself might, eventually)
2 is the problem I'm seeing, also what is the master configuration you speak of?
As @bleggett said, 2 has a number of workarounds. @AceHack already mentioned envoy-preflight. For 1, In Spark apps created by Spark operator and In general for Spark on k8s, Driver listens on the POD_IP. This part is hardcoded in the Spark code.
https://github.com/apache/spark/blob/3a299aa6480ac22501512cd0310d31a441d7dfdc/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/DriverServiceFeatureStep.scala#L34
Istio has been dependent on the primary container to be listening on 127.0.0.1. istio/istio#27900 is a fix for that, Istio should be able to work when the driver pod listens on POD_IP.
Spark Master and workers, they are a different way of deployment. In a way they are independent of being deployed on Kubernetes or Mesos as they are not using the platform to create workers/executors.
@AceHack @puneetloya For vanilla Spark, all you have to do is set the SPARK_MASTER_HOST environment variable to 127.0.0.1
- no Istio changes are required.
Example values.yaml
I pass to the Bitnami Spark chart:
master:
extraEnvVars:
- name: SPARK_MASTER_HOST
value: "127.0.0.1"
That's all you need for vanilla Spark and Istio to play nice. istio/istio#27900 isn't strictly necessary for Spark bug 1, since Spark config supports binding to local addresses just fine out of the box with the right config.
@bleggett I think this advice is outdated with v1.10 that changes the default route envoy forward inbound packets to, to be eth0 rather than lo.
Also on this topic, can we make this operator name the ports according to their protocols?
$ istioctl analyze -n flows
Info [IST0118] (Service sparkpi-py-0ae54679adebbb65-driver-svc.flows) Port name blockmanager (port: 7079, targetPort: 7079) doesn't follow the naming convention of Istio port.
Info [IST0118] (Service sparkpi-py-0ae54679adebbb65-driver-svc.flows) Port name driver-rpc-port (port: 7078, targetPort: 7078) doesn't follow the naming convention of Istio port.
Info [IST0118] (Service sparkpi-py-0ae54679adebbb65-driver-svc.flows) Port name spark-ui (port: 4040, targetPort: 4040) doesn't follow the naming convention of Istio port.
Info [IST0118] (Service sparkpi-py-ui-svc.flows) Port name spark-driver-ui-port (port: 4040, targetPort: 4040) doesn't follow the naming convention of Istio port.
@haf Does #1239 address this?
Info [IST0118] (Service sparkpi-py-0ae54679adebbb65-driver-svc.flows) Port name driver-rpc-port (port: 7078, targetPort: 7078) doesn't follow the naming convention of Istio port.
AFAIK, the naming of ports is not controlled by the spark operator. Its a code change required in the Apache Spark distribution(this was valid atleast till Spark 2.4, I am not sure if this was changed in Spark 3). The headless service is created by the driver once it is up, to make itself discoverable to executors.
@jkleckner Yes, but it would probably make sense to change the default name, too.
Asking here, since it's seems to be the thread. Currently trying to make the Spark operator work with the following stack:
Whatever I do, the executor fails with the following error:
21/08/24 16:43:47 INFO TransportClientFactory: Successfully created connection to spark-pi-bb55307b78fb84d6-driver-svc.spark.svc/192.168.5.156:7078 after 151 ms (0 ms spent in bootstraps)
21/08/24 16:43:47 ERROR TransportClient: Failed to send RPC RPC 7365963296669017718 to spark-pi-bb55307b78fb84d6-driver-svc.spark.svc/192.168.5.156:7078: java.nio.channels.ClosedChannelException
java.nio.channels.ClosedChannelException
at io.netty.channel.AbstractChannel$AbstractUnsafe.newClosedChannelException(AbstractChannel.java:957)
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:865)
...
I'm not sure what I'm doing wrong: from various comments on this thread and others, this should work, no? The issue with the istio master listening on the pod ip and not 127.0.0.1 has been fixed in the versions I'm using.
If anyone has an idea, I'd be very thankful!
Hi @MrTrustor, did you resolve your issue? I am running into the same thing, though with slightly older istio so I am setting SPARK_MASTER_HOST
No, I haven't. We had to run Spark without istio in the end.
Failed to send RPC RPC xx to xx: java.nio.channels.ClosedChannelException
FWIW, I was able to get executor->driver communication to work by overriding the env var SPARK_DRIVER_BIND_ADDRESS to 127.0.0.1 on the driver. The default is hardcoded to PodIP and was not working with our version of istio.
We were facing the same error as well org.apache.spark.shuffle.FetchFailedException: Failed to send RPC RPC 5700455101843038265 to /xxx: java.nio.channels.ClosedChannelException
, and we were able to get communication between driver & executor pod up & running by the following :
spark.blockManager.port: "7080"
apiVersion: v1
kind: Service
metadata:
name: spark-executor
spec:
clusterIP: None
ports:
- name: blockmanager
protocol: TCP
port: 7080
selector:
spark-role: executor
Note that it doesn't have to be 7080 port, instead it can be any port number. By default it's assigning a random port.
Credits to @dayyeung
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
When istio is set to auto inject the executors fail to talk to the driver and jobs never finish. Please add first-class support for istio.