Open Dlougach opened 1 year ago
Hi, I've developed a module for deploying the latest 3.4.0 server-client model with support for direct PySpark Session connections!
Perhaps you could try this first, although it's based on native Spark's executor scheduling rather than Operator.
https://github.com/Wh1isper/sparglim#spark-connect-server-on-k8s
@Dlougach If you wish to use the operator, there is a way.
First step, you need to make a wrapper class around the SparkConnectServer. For example:
package my.connect.server
import org.apache.spark.sql.connect.service.SparkConnectServer
object MyCoolSparkConnectServer {
def main(args: Array[String]): Unit = {
// you can initialize spark stuff here, the server will inherit the configs
val connectServer = SparkConnectServer
connectServer.main(args)
}
}
This will circumvent the submission error you may have gotten if you tried to just run the SparkConnectServer
directly. From my investigation that looks to be an arbitrary decision within spark-submit that Cluster mode is "not applicable" to SparkConnect. Which is sort of true except when using this operator :)
After that the job should spin up, but to get it working in-cluster you need to add a service component (you could just use the driver pod name directly but that feels icky to me)
apiVersion: v1
kind: Service
metadata:
name: <service-name>
spec:
type: ClusterIP
selector:
spark-role: driver
sparkoperator.k8s.io/app-name: <name-of-spark-app>
ports:
- name: grpc
port: 15002
targetPort: 15002
protocol: TCP
With this you can use sc://<service-name>.<spark-app-namespace>.svc.cluster.local:15002
.
I also think that due to https://github.com/kubeflow/spark-operator/issues/1404, routing through Istio or other Service Mesh's might not work yet. Since the container port on the driver isn't passed to the pods I think Istio doesn't treat the route as GRPC. When trying to go through Istio ingress it hangs, though I haven't tried patching our CRDs yet.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.
/reopen
@sumanth-manchala: You can't reopen an issue/PR unless you authored it or you are a collaborator.
Now that Spark Connect is getting to GA with Spark 4.0.0, can we please have a native way to support Spark Connect and also this should solve using spark with client mode via this operator
/reopen
@ChenYi015: Reopened this issue.
maybe the solution would be to create a new CRD(SparkConnectServer/Application)? I think the existing crd will work on 99%, but replace spark-summit command with spark-connect-server.sh with spark-properties.conf (generate from crd)
for example like here
This is a very important feature and I do agree @whitleykeith that there is no reason why this cannot be run in a cluster mode. Also @whitleykeith say I include that wrapper class and build a jar, now how does the SparkApplication mainfest file would look like?
I wanted to jump in here regarding the discussion about running Spark Connect in cluster mode. While this feature is undoubtedly important, I believe there may be some underlying issues when using Kubernetes as the cluster manager for Spark Connect.
For example, SPARK-45769 reports data retrieval failures on executors when using Spark Connect, and SPARK-46032 points out issues related to lambda serialization. Additionally, I have personally encountered issues when using UDFs with Spark Connect in cluster mode.
These problems could be the reason why Spark Connect is currently limited to standalone mode, despite its potential.
I’ve also raised a similar question in the Spark user mailing list, but I have not yet received a clear answer. You can check out the thread here: Apache Spark User Mailing List.
Spark Connect is not limited to standalone mode, you can run it in Kubernetes by running spark connect server in pod and adding an additional service resource for the server pod (as pointed out in https://github.com/kubeflow/spark-operator/issues/1801#issuecomment-2000494607). (We use custom made helm charts for that).
You just cannot use spark-submit --cluster to do that.
I didn't see any discussions on why it was made like that, but spark-submit is usually used to submit jobs, not to provision long-running clusters so maybe that drove the decision. But I don't think the reason was bugs in Spark connect as there are not that many differences from spark connect point of view of running in standalone vs on kubernetes.
In regards to whether there are any issues of running spark connect on kubernetes - we are running spark connect on k8s for over a year now in production and there are many spark connect bugs and limitations, but we haven't encountered any issues specifically related to kubernetes as a cluster manager.
BTW I'm working on this issue
It would be nice to either support Spark Connect server natively within the Spark operator or at least provide a tutorial to on how to set it up (it supposedly substitutes driver).