kubeflow / spark-operator

Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.
Apache License 2.0
2.8k stars 1.38k forks source link

Add config/instructions for Beam Python #870

Open ibzib opened 4 years ago

ibzib commented 4 years ago

The GCP Flink operator has config and instructions for how to run a Beam Python job: https://github.com/GoogleCloudPlatform/flink-on-k8s-operator/blob/master/docs/beam_guide.md

We should be able to port these instructions to use the Spark operator.

ibzib commented 4 years ago

(I also intend to do this myself at some point, but I don't have permission to change Assignees.)

liyinan926 commented 4 years ago

Assigned this to you @ibzib.

Jeffwan commented 4 years ago

Yeah. This would be helpful. Looking forward to that! Flink operator actually has some limitations and known issues now. Even spark support few operations. It's worth to add the support.

adux6991 commented 3 years ago

Hi @ibzib, thanks for your team's great work. Could you please share some thoughts on how to run a beam python job on top of SparkApplication? I am working on it but couldn't find a centralized "spark master" where I could submit job via beam python sdk:

import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions

options = PipelineOptions([
    "--runner=SparkRunner",
    "--spark_master_url=SPARK_MASTER_URL"
])
with beam.Pipeline(options) as p:
    ...
ibzib commented 3 years ago

Hi @adux6991, cool to see interest in this. Feel free to reassign yourself to this issue - I am always willing to provide support but I don't think I'll have the time to actually implement this myself any time soon.

The usual process

You start a Spark cluster using the instructions here: https://spark.apache.org/docs/latest/spark-standalone.html#starting-a-cluster-manually Then copy the URL that is printed out when starting the master. It is usually spark://localhost:7077 by default.

Alternatively, if you set --spark_master_url=local[*], Beam will automatically start a transient Spark cluster, run the job on it, and shut it down when the job is finished.

For spark-on-k8s-operator

To be honest, I don't know much about this project, so I don't know if the Spark master URL is meant to be exposed to the user directly. A quick grep shows that the sparkapplication implementation knows, though: https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/cf3446097f47047ab46a076ebdb1f71ee9428f84/pkg/controller/sparkapplication/submission.go#L201

Alternatively, if you can't find an exposed master URL, we can always go down a layer and construct a jar using --output_executable_path and then create a SparkApplication using that jar as its mainApplicationFile.

luizm commented 3 years ago

It would be great. I starting to look into this now..

@adux6991 and @ibzib Is there any progress on this topic?

ibzib commented 3 years ago

I started working on this in my free time. I went with the --output_executable_path approach I described in my last update. I was able to submit the application successfully, however the executor was not able to connect to the worker pool sidecar for some reason. Not sure why, since the equivalent works fine in the Flink example.

github-actions[bot] commented 2 weeks ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.