Open ibzib opened 4 years ago
(I also intend to do this myself at some point, but I don't have permission to change Assignees.)
Assigned this to you @ibzib.
Yeah. This would be helpful. Looking forward to that! Flink operator actually has some limitations and known issues now. Even spark support few operations. It's worth to add the support.
Hi @ibzib, thanks for your team's great work. Could you please share some thoughts on how to run a beam python job on top of SparkApplication? I am working on it but couldn't find a centralized "spark master" where I could submit job via beam python sdk:
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
options = PipelineOptions([
"--runner=SparkRunner",
"--spark_master_url=SPARK_MASTER_URL"
])
with beam.Pipeline(options) as p:
...
Hi @adux6991, cool to see interest in this. Feel free to reassign yourself to this issue - I am always willing to provide support but I don't think I'll have the time to actually implement this myself any time soon.
You start a Spark cluster using the instructions here: https://spark.apache.org/docs/latest/spark-standalone.html#starting-a-cluster-manually
Then copy the URL that is printed out when starting the master. It is usually spark://localhost:7077
by default.
Alternatively, if you set --spark_master_url=local[*]
, Beam will automatically start a transient Spark cluster, run the job on it, and shut it down when the job is finished.
To be honest, I don't know much about this project, so I don't know if the Spark master URL is meant to be exposed to the user directly. A quick grep shows that the sparkapplication implementation knows, though: https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/cf3446097f47047ab46a076ebdb1f71ee9428f84/pkg/controller/sparkapplication/submission.go#L201
Alternatively, if you can't find an exposed master URL, we can always go down a layer and construct a jar using --output_executable_path and then create a SparkApplication using that jar as its mainApplicationFile
.
It would be great. I starting to look into this now..
@adux6991 and @ibzib Is there any progress on this topic?
I started working on this in my free time. I went with the --output_executable_path
approach I described in my last update. I was able to submit the application successfully, however the executor was not able to connect to the worker pool sidecar for some reason. Not sure why, since the equivalent works fine in the Flink example.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
The GCP Flink operator has config and instructions for how to run a Beam Python job: https://github.com/GoogleCloudPlatform/flink-on-k8s-operator/blob/master/docs/beam_guide.md
We should be able to port these instructions to use the Spark operator.