kubeflow / spark-operator

Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.
Apache License 2.0
2.8k stars 1.38k forks source link

Spark Connect support #1801

Open Dlougach opened 1 year ago

Dlougach commented 1 year ago

It would be nice to either support Spark Connect server natively within the Spark operator or at least provide a tutorial to on how to set it up (it supposedly substitutes driver).

Wh1isper commented 1 year ago

Hi, I've developed a module for deploying the latest 3.4.0 server-client model with support for direct PySpark Session connections!

Perhaps you could try this first, although it's based on native Spark's executor scheduling rather than Operator.

https://github.com/Wh1isper/sparglim#spark-connect-server-on-k8s

whitleykeith commented 8 months ago

@Dlougach If you wish to use the operator, there is a way.

First step, you need to make a wrapper class around the SparkConnectServer. For example:

package my.connect.server

import org.apache.spark.sql.connect.service.SparkConnectServer

object MyCoolSparkConnectServer {
  def main(args: Array[String]): Unit = {
   // you can initialize spark stuff here, the server will inherit the configs
    val connectServer = SparkConnectServer
    connectServer.main(args)
  }

}

This will circumvent the submission error you may have gotten if you tried to just run the SparkConnectServer directly. From my investigation that looks to be an arbitrary decision within spark-submit that Cluster mode is "not applicable" to SparkConnect. Which is sort of true except when using this operator :)

After that the job should spin up, but to get it working in-cluster you need to add a service component (you could just use the driver pod name directly but that feels icky to me)

apiVersion: v1
kind: Service
metadata:
  name: <service-name>
spec:
  type: ClusterIP
  selector:
    spark-role: driver
    sparkoperator.k8s.io/app-name: <name-of-spark-app>
  ports:
    - name: grpc
      port: 15002
      targetPort: 15002
      protocol: TCP

With this you can use sc://<service-name>.<spark-app-namespace>.svc.cluster.local:15002.

I also think that due to https://github.com/kubeflow/spark-operator/issues/1404, routing through Istio or other Service Mesh's might not work yet. Since the container port on the driver isn't passed to the pods I think Istio doesn't treat the route as GRPC. When trying to go through Istio ingress it hangs, though I haven't tried patching our CRDs yet.

github-actions[bot] commented 3 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] commented 2 months ago

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

sumanth-manchala commented 2 months ago

/reopen

google-oss-prow[bot] commented 2 months ago

@sumanth-manchala: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to [this](https://github.com/kubeflow/spark-operator/issues/1801#issuecomment-2354583172): >/reopen Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
sumanth-manchala commented 2 months ago

Now that Spark Connect is getting to GA with Spark 4.0.0, can we please have a native way to support Spark Connect and also this should solve using spark with client mode via this operator

ChenYi015 commented 2 months ago

/reopen

google-oss-prow[bot] commented 2 months ago

@ChenYi015: Reopened this issue.

In response to [this](https://github.com/kubeflow/spark-operator/issues/1801#issuecomment-2357330178): >/reopen Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
aagumin commented 1 month ago

maybe the solution would be to create a new CRD(SparkConnectServer/Application)? I think the existing crd will work on 99%, but replace spark-summit command with spark-connect-server.sh with spark-properties.conf (generate from crd)

for example like here

https://github.com/aagumin/spark-connect-kubernetes/blob/main/charts/spark-connect/templates/stateful-set.yaml#L112

kant777 commented 1 month ago

This is a very important feature and I do agree @whitleykeith that there is no reason why this cannot be run in a cluster mode. Also @whitleykeith say I include that wrapper class and build a jar, now how does the SparkApplication mainfest file would look like?

yassan commented 1 month ago

I wanted to jump in here regarding the discussion about running Spark Connect in cluster mode. While this feature is undoubtedly important, I believe there may be some underlying issues when using Kubernetes as the cluster manager for Spark Connect.

For example, SPARK-45769 reports data retrieval failures on executors when using Spark Connect, and SPARK-46032 points out issues related to lambda serialization. Additionally, I have personally encountered issues when using UDFs with Spark Connect in cluster mode.

These problems could be the reason why Spark Connect is currently limited to standalone mode, despite its potential.

I’ve also raised a similar question in the Spark user mailing list, but I have not yet received a clear answer. You can check out the thread here: Apache Spark User Mailing List.

vakarisbk commented 1 month ago

Spark Connect is not limited to standalone mode, you can run it in Kubernetes by running spark connect server in pod and adding an additional service resource for the server pod (as pointed out in https://github.com/kubeflow/spark-operator/issues/1801#issuecomment-2000494607). (We use custom made helm charts for that).

You just cannot use spark-submit --cluster to do that.

I didn't see any discussions on why it was made like that, but spark-submit is usually used to submit jobs, not to provision long-running clusters so maybe that drove the decision. But I don't think the reason was bugs in Spark connect as there are not that many differences from spark connect point of view of running in standalone vs on kubernetes.

In regards to whether there are any issues of running spark connect on kubernetes - we are running spark connect on k8s for over a year now in production and there are many spark connect bugs and limitations, but we haven't encountered any issues specifically related to kubernetes as a cluster manager.

vakarisbk commented 2 weeks ago

BTW I'm working on this issue