kubeflow / arena

A CLI for Kubeflow.
Apache License 2.0
730 stars 177 forks source link

Failed to submit spark training job as driver service account is not configurable #1111

Closed ChenYi015 closed 1 month ago

ChenYi015 commented 1 month ago

I tried to submit Spark training job, refering Submit a distributed spark job - Arena Documentation.

arena submit sparkjob \
   --name=sparktest \
   --image=registry.aliyuncs.com/acs/spark-pi:ack-2.4.5-latest \
   --main-class=org.apache.spark.examples.SparkPi \
   --jar=local:///opt/spark/examples/jars/spark-examples_2.11-2.4.5.jar

And it fails with error message related to ServiceAccount:

Message: Forbidden!Configured service account doesn't have access. Service account may have been revoked. pods \"sparktest-driver\" is forbidden: error looking up service account default/spark: serviceaccount \"spark\" not found.

And there is no flags in spark submit spark used to configure ServiceAccount:

$ bin/arena submit spark --help
Submit a common spark application job.

Usage:
  arena submit sparkjob [flags]

Aliases:
  sparkjob, spark

Flags:
  -a, --annotation stringArray           the annotations, usage: "--annotation=key=value" or "--annotation key=value"
      --driver-cpu-request int           cpu request for driver pod (default 1)
      --driver-memory-request string     memory request for driver pod (min is 500m) (default "500m")
      --executor-cpu-request int         cpu request for executor pod (default 1)
      --executor-memory-request string   memory request for executor pod (min is 500m) (default "500m")
  -h, --help                             help for sparkjob
      --image string                     the docker image name of training job (default "registry.aliyuncs.com/acs/spark:v2.4.0")
      --jar string                       jar path in image (default "local:///opt/spark/examples/jars/spark-examples_2.11-2.4.0.jar")
  -l, --label stringArray                specify the label
      --main-class string                main class of your jar (default "org.apache.spark.examples.SparkPi")
      --name string                      override name
      --replicas int                     the executor's number to run the distributed training. (default 1)
Syulin7 commented 1 month ago

@ChenYi015 Thanks for your grate contribution!