CRDs to cover all major spark configuration properties.

adwk67 commented 2 years ago

Summary

As a user of spark-on-k8s I want to be able to use CRDs that cover all major spark configuration properties.

Detail

The initial spike (in branch spark-submit-wrapper) parses the CRD to build a spark-submit command, which is then executed as a job issued from the controller. This job is executed by spark (by passing it through the entrypoint.sh script in the spark docker image used for spawning the driver and executors), by first creating the driver (living in its own pod), which then creates temporary executors. Once the job is completed, the driver and the job that created it, are still visible:

Implementation

This issue covers the following:

investigate alternatives to the above, specifically if it is better to do away with the interim job and instead create the driver directly from the controller, documenting how this is done (or why it is not done)
implement the configuration properties listed in the comment below

Update 2022-05-16

Remaining options are marked in this comment with TODO.

fhennig commented 2 years ago

Uh oh, what are the major ones? :smile:

adwk67 commented 2 years ago

I would suggest we start with the ones that spark-submit -h minus things like kerberos keytab, maven repositories e.g.

  --master MASTER_URL         k8s://https://host:port
  --deploy-mode DEPLOY_MODE   Whether to launch the driver program locally ("client") or
                              on one of the worker machines inside the cluster ("cluster")
                              (Default: client).
  --class CLASS_NAME          Your application's main class (for Java / Scala apps).
  --name NAME                 A name of your application.
  --jars JARS                 Comma-separated list of jars to include on the driver
                              and executor classpaths.
  --packages                  Comma-separated list of maven coordinates of jars to include
                              on the driver and executor classpaths. Will search the local
                              maven repo, then maven central and any additional remote
                              repositories given by --repositories. The format for the
                              coordinates should be groupId:artifactId:version.
  --exclude-packages          Comma-separated list of groupId:artifactId, to exclude while
                              resolving the dependencies provided in --packages to avoid
                              dependency conflicts.
  --py-files PY_FILES         Comma-separated list of .zip, .egg, or .py files to place
                              on the PYTHONPATH for Python apps.
  --files FILES               Comma-separated list of files to be placed in the working
                              directory of each executor. File paths of these files
                              in executors can be accessed via SparkFiles.get(fileName).
  --archives ARCHIVES         Comma-separated list of archives to be extracted into the
                              working directory of each executor.
  --conf, -c PROP=VALUE       Arbitrary Spark configuration property.
  --properties-file FILE      Path to a file from which to load extra properties. If not
                              specified, this will look for conf/spark-defaults.conf.
  --driver-memory MEM         Memory for driver (e.g. 1000M, 2G) (Default: 1024M).
  --driver-java-options       Extra Java options to pass to the driver.
  --driver-library-path       Extra library path entries to pass to the driver.
  --driver-class-path         Extra class path entries to pass to the driver. Note that
                              jars added with --jars are automatically included in the
                              classpath.
  --executor-memory MEM       Memory per executor (e.g. 1000M, 2G) (Default: 1G).
  --driver-cores NUM          Number of cores used by the driver, only in cluster mode
                              (Default: 1).
  --total-executor-cores NUM  Total cores for all executors.
  --executor-cores NUM        Number of cores used by each executor. (Default: 1 in
                              YARN and K8S modes, or all available cores on the worker
                              in standalone mode).
  --num-executors NUM         Number of executors to launch (Default: 2).
                              If dynamic allocation is enabled, the initial number of
                              executors will be at least NUM.

sbernauer commented 2 years ago

I would suggest including the following informations in CRD. They don't need to be implemented for the first version, just for the records and to keep that in mind when designing the CRD.

Wether it is a single time Job (e.g. ETL job; backed by a k8s Job) or a Deployment (e.g. streaming job; backed by a k8s Cronjob)
Allow specifying the retry policy of the Job. Some jobs can be retried, some not as they otherwise compromise data.

I'm also happy to create issues out of this if you prefer ;)

adwk67 commented 2 years ago

I'm also happy to create issues out of this if you prefer ;)

Thanks for adding that detail, Sebastian! Yes, I think that would be better in a separate issue/issues so we can keep things manageable.

soenkeliebau commented 2 years ago

Just one comment that occurred while reading: do we want to make '--master MASTER_URL' configurable? Shouldn't that be an implementation detail because we are the ones who know best where the master is?

adwk67 commented 2 years ago

Yes, you're right. In fact in the spike I've ignored --master and read the api server host/port directly from the environment variables in the pod.

adwk67 commented 2 years ago

Update:

  --master MASTER_URL         done: read from env vars
  --deploy-mode DEPLOY_MODE   done: leave as cluster mode for the time being
  --class CLASS_NAME          done
  --name NAME                 done
  --jars JARS                 done: made available via PV/PVCs and spark.{driver|executor}.extraClassPath
  --packages                  done (for python packages)
  --conf, -c PROP=VALUE       done
  --driver-memory MEM         done
  --executor-memory MEM       done
  --driver-cores NUM          done
  --total-executor-cores NUM  done
  --executor-cores NUM        done
  --num-executors NUM         done
  --repositories              done

  --driver-java-options       TODO
  --driver-library-path       TODO
  --driver-class-path         TODO
  --properties-file FILE      TODO
  --exclude-packages          TODO
  --py-files PY_FILES         TODO
  --files FILES               TODO
  --archives ARCHIVES         TODO

razvan commented 1 year ago

I propose to close this issue and implement the remaining args if necessary on a case by case basis.

stackabletech / spark-k8s-operator