Closed adwk67 closed 5 months ago
Uh oh, what are the major ones? :smile:
I would suggest we start with the ones that spark-submit -h
minus things like kerberos keytab, maven repositories e.g.
--master MASTER_URL k8s://https://host:port
--deploy-mode DEPLOY_MODE Whether to launch the driver program locally ("client") or
on one of the worker machines inside the cluster ("cluster")
(Default: client).
--class CLASS_NAME Your application's main class (for Java / Scala apps).
--name NAME A name of your application.
--jars JARS Comma-separated list of jars to include on the driver
and executor classpaths.
--packages Comma-separated list of maven coordinates of jars to include
on the driver and executor classpaths. Will search the local
maven repo, then maven central and any additional remote
repositories given by --repositories. The format for the
coordinates should be groupId:artifactId:version.
--exclude-packages Comma-separated list of groupId:artifactId, to exclude while
resolving the dependencies provided in --packages to avoid
dependency conflicts.
--py-files PY_FILES Comma-separated list of .zip, .egg, or .py files to place
on the PYTHONPATH for Python apps.
--files FILES Comma-separated list of files to be placed in the working
directory of each executor. File paths of these files
in executors can be accessed via SparkFiles.get(fileName).
--archives ARCHIVES Comma-separated list of archives to be extracted into the
working directory of each executor.
--conf, -c PROP=VALUE Arbitrary Spark configuration property.
--properties-file FILE Path to a file from which to load extra properties. If not
specified, this will look for conf/spark-defaults.conf.
--driver-memory MEM Memory for driver (e.g. 1000M, 2G) (Default: 1024M).
--driver-java-options Extra Java options to pass to the driver.
--driver-library-path Extra library path entries to pass to the driver.
--driver-class-path Extra class path entries to pass to the driver. Note that
jars added with --jars are automatically included in the
classpath.
--executor-memory MEM Memory per executor (e.g. 1000M, 2G) (Default: 1G).
--driver-cores NUM Number of cores used by the driver, only in cluster mode
(Default: 1).
--total-executor-cores NUM Total cores for all executors.
--executor-cores NUM Number of cores used by each executor. (Default: 1 in
YARN and K8S modes, or all available cores on the worker
in standalone mode).
--num-executors NUM Number of executors to launch (Default: 2).
If dynamic allocation is enabled, the initial number of
executors will be at least NUM.
I would suggest including the following informations in CRD. They don't need to be implemented for the first version, just for the records and to keep that in mind when designing the CRD.
I'm also happy to create issues out of this if you prefer ;)
I'm also happy to create issues out of this if you prefer ;)
Thanks for adding that detail, Sebastian! Yes, I think that would be better in a separate issue/issues so we can keep things manageable.
Just one comment that occurred while reading: do we want to make '--master MASTER_URL' configurable? Shouldn't that be an implementation detail because we are the ones who know best where the master is?
Yes, you're right. In fact in the spike I've ignored --master
and read the api server host/port directly from the environment variables in the pod.
Update:
--master MASTER_URL done: read from env vars
--deploy-mode DEPLOY_MODE done: leave as cluster mode for the time being
--class CLASS_NAME done
--name NAME done
--jars JARS done: made available via PV/PVCs and spark.{driver|executor}.extraClassPath
--packages done (for python packages)
--conf, -c PROP=VALUE done
--driver-memory MEM done
--executor-memory MEM done
--driver-cores NUM done
--total-executor-cores NUM done
--executor-cores NUM done
--num-executors NUM done
--repositories done
--driver-java-options TODO
--driver-library-path TODO
--driver-class-path TODO
--properties-file FILE TODO
--exclude-packages TODO
--py-files PY_FILES TODO
--files FILES TODO
--archives ARCHIVES TODO
I propose to close this issue and implement the remaining args if necessary on a case by case basis.
Summary
As a user of spark-on-k8s I want to be able to use CRDs that cover all major spark configuration properties.
Detail
The initial spike (in branch
spark-submit-wrapper
) parses the CRD to build a spark-submit command, which is then executed as a job issued from the controller. This job is executed by spark (by passing it through the entrypoint.sh script in the spark docker image used for spawning the driver and executors), by first creating the driver (living in its own pod), which then creates temporary executors. Once the job is completed, the driver and the job that created it, are still visible:Implementation
This issue covers the following:
Update 2022-05-16
Remaining options are marked in this comment with
TODO
.