kubeflow / spark-operator

Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.
Apache License 2.0
2.77k stars 1.38k forks source link

Spark properties related to deployment are not applied #1109

Open pgillet opened 3 years ago

pgillet commented 3 years ago

Some Spark properties are related to deployment, typically set through configuration file or spark-submit command line options with "native" Spark. These properties will not be applied if passed directly to .spec.sparkConf in the SparkApplication custom resource. Indeed, .spec.sparkConf is only intended for properties that affect Spark runtime control, like spark.task.maxFailures.

Example: Setting spark.executor.instances in .spec.sparkConf will not affect the number of executors. Instead, we have to set the field .spec.executor.instances in the SparkApplication yaml file.

It would be nice if we could set/override such properties in .spec.sparkConf. Thus, we could easily "templatize" a SparkApplication and set runtime parameters with Spark semantics. In other terms, we should be able to move the cursor as we want between native Spark and Spark Operator semantics.

The concerned properties that I have identified so far:

I do not know if spark.submit.pyFiles and spark.jars are also concerned. If they are, it is a problem as these properties are multi-valued: .spec.deps.pyFiles must be an array of strings, while the Spark property is only a string containing comma-separated Python dependencies, and in this case it is not easy to switch from the Spark semantics to the Spark Operator logic...

github-actions[bot] commented 1 week ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.