kubeflow / spark-operator

Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.
Apache License 2.0
2.79k stars 1.38k forks source link

Spark 2.3.1 Support #224

Closed ryancampbell closed 6 years ago

ryancampbell commented 6 years ago

Hello, chatted in Slack as well, but my team has been trying to switch to Spark 2.3.1 using this guide: https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/docs/user-guide.md#customizing-the-spark-operator

After completing the below steps any driver pods immediately fail with "Error: Could not find or load main class "

Possibly we are missing a step? Read below. Assistance would be appreciated as 2.3.1 solves a bug in dynamic partition overwrite mode.

Download Spark source code https://archive.apache.org/dist/spark/spark-2.3.1/spark-2.3.1.tgz

Compile Spark with Kubernetes support ./build/mvn -Pkubernetes -DskipTests clean package

Build Spark 2.3.1 docker image $ ./bin/docker-image-tool.sh -r gcr.io/uncoil-io/spark -t v2.3.1 build $ ./bin/docker-image-tool.sh -r gcr.io/uncoil-io/spark -t v2.3.1 push

Clone https://github.com/GoogleCloudPlatform/spark-on-k8s-gcp-examples

Copy conf folder to dockerfiles/spark-gcs

Edit FROM in dockerfiles/spark-gcs/Dockerfile to: FROM gcr.io/uncoil-io/spark/spark:v2.3.1

Run “gcloud auth configure-docker”

From spark-op-k8s-gcp-examples/dockerfiles/spark-gcs run:

docker build . -t gcr.io/uncoil-io/spark:v2.3.1-gcs docker push gcr.io/uncoil-io/spark:v2.3.1-gcs

Clone https://github.com/GoogleCloudPlatform/spark-on-k8s-operator Can checkout master or v1alpha-0.2-2.3.x tag, which ever works in the end

Edit spark-on-k8s-operator/Dockerfile second FROM to be: FROM gcr.io/uncoil-io/spark:v2.3.1

Run in spark-on-k8s-operator

docker build . -t gcr.io/uncoil-io/spark-operator:v1alpha1-0.2-2.3.1 docker push gcr.io/uncoil-io/spark-operator:v1alpha1-0.2-2.3.1

Edit spark-on-k8s-operator/manifest/spark-operator.yaml Set image to gcr.io/uncoil-io/spark-operator:v1alpha1-0.2-2.3.1

Delete the sparkoperator namespace kubectl delete namespace sparkoperator

Delete any sparkapplications as well kubectl delete sparkapplications --all kubectl delete scheduledsparkapplications --all

Wait a few minutes.... then kubectl apply -f spark-on-k8s-operator/manifest/

Check for when ready kubectl get pods -w --namespace sparkoperator

Edit app.template.yaml and uncoil-runner-yaml: image: gcr.io/uncoil-io/spark:v2.3.1-gcs replace 2.3.0 with 2.3.1 in dependencies and the version label

Now see if it works!

liyinan926 commented 6 years ago

Edit spark-on-k8s-operator/Dockerfile second FROM to be: FROM gcr.io/uncoil-io/spark:v2.3.1

Should you be using FROM gcr.io/uncoil-io/spark:v2.3.1-gcs instead of FROM gcr.io/uncoil-io/spark:v2.3.1 in your spark-on-k8s-operator/Dockerfile?

What you use spark:v2.3.1-gcs for? Is there any application dependency to be downloaded from GCS?

ryancampbell commented 6 years ago

@liyinan926 I'll try that again, although I believe I tried both.

I used "gcr.io/uncoil-io/spark:v2.3.1" instead of "gcr.io/uncoil-io/spark:v2.3.1-gcs" because the original Dockerfile used "gcr.io/ynli-k8s/spark:v2.3.0" and not "gcr.io/ynli-k8s/spark:v2.3.0-gcs"

liyinan926 commented 6 years ago

Can you post your SparkApplication spec here?

ryancampbell commented 6 years ago
apiVersion: "sparkoperator.k8s.io/v1alpha1"
kind: ScheduledSparkApplication
metadata:
  name: uncoil-runner
spec:
  schedule: "@every 5m"
  concurrencyPolicy: Forbid
  template:
    type: Scala
    mode: cluster
    image: gcr.io/uncoil-io/spark:v2.3.1-gcs
    mainClass: uncoil.UncoilRunner
    mainApplicationFile: gs://uncoil-artifacts/uncoil-job/uncoil-job.jar
    deps:
      jars:
        - http://central.maven.org/maven2/org/apache/commons/commons-pool2/2.5.0/commons-pool2-2.5.0.jar
        - http://central.maven.org/maven2/redis/clients/jedis/2.9.0/jedis-2.9.0.jar
        - http://dl.bintray.com/spark-packages/maven/RedisLabs/spark-redis/0.3.2/spark-redis-0.3.2.jar
        - http://central.maven.org/maven2/org/apache/kafka/kafka-clients/0.10.0.1/kafka-clients-0.10.0.1.jar
        - http://central.maven.org/maven2/org/apache/spark/spark-sql-kafka-0-10_2.11/2.3.1/spark-sql-kafka-0-10_2.11-2.3.1.jar
        - http://central.maven.org/maven2/org/apache/spark/spark-sql_2.11/2.3.1/spark-sql_2.11-2.3.1.jar
        - http://central.maven.org/maven2/com/typesafe/config/1.3.2/config-1.3.2.jar
        - http://central.maven.org/maven2/mysql/mysql-connector-java/8.0.11/mysql-connector-java-8.0.11.jar
    imagePullPolicy: Always
    hadoopConf:
      "fs.gs.project.id": "uncoil-io"
      "fs.gs.system.bucket": "uncoil-spark-production"
      "google.cloud.auth.service.account.enable": "true"
      "google.cloud.auth.service.account.json.keyfile": "/mnt/secrets/spark-sa.json"
    driver:
      cores: 1
      memory: 3g
      labels:
        version: 2.3.1
      serviceAccount: spark-sa
      secrets:
      - name: "spark-sa"
        path: "/mnt/secrets"
        secretType: GCPServiceAccount
      envVars:
        GCS_PROJECT_ID: uncoil-io
        SCALA_ENV: production
    executor:
      instances: 2
      cores: 2
      memory: 10g
      labels:
        versions: 2.3.1
      secrets:
      - name: "spark-sa"
        path: "/mnt/secrets"
        secretType: GCPServiceAccount
      envVars:
        GCS_PROJECT_ID: uncoil-io
        SCALA_ENV: production
liyinan926 commented 6 years ago

Can you run kubectl logs -c spark-init <driver pod name>? This will give you the init container logs. The main application file gs://uncoil-artifacts/uncoil-job/uncoil-job.jar needs to be downloaded by the init container from GCS first before the driver starts running.

ryancampbell commented 6 years ago
++ id -u
+ myuid=0
++ id -g
+ mygid=0
++ getent passwd 0
+ uidentry=root:x:0:0:root:/root:/bin/ash
+ '[' -z root:x:0:0:root:/root:/bin/ash ']'
+ SPARK_K8S_CMD=init
+ '[' -z init ']'
+ shift 1
+ SPARK_CLASSPATH=':/opt/spark/jars/*'
+ env
+ sort -t_ -k4 -n
+ sed 's/[^=]*=\(.*\)/\1/g'
+ grep SPARK_JAVA_OPT_
+ readarray -t SPARK_JAVA_OPTS
+ '[' -n '' ']'
+ '[' -n '' ']'
+ case "$SPARK_K8S_CMD" in
+ CMD=("$SPARK_HOME/bin/spark-class" "org.apache.spark.deploy.k8s.SparkPodInitContainer" "$@")
+ exec /sbin/tini -s -- /opt/spark/bin/spark-class org.apache.spark.deploy.k8s.SparkPodInitContainer /etc/spark-init/spark-init.properties
2018-07-19 18:12:27 INFO  SparkPodInitContainer:54 - Starting init-container to download Spark application dependencies.
2018-07-19 18:12:27 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2018-07-19 18:12:27 INFO  SecurityManager:54 - Changing view acls to: root
2018-07-19 18:12:27 INFO  SecurityManager:54 - Changing modify acls to: root
2018-07-19 18:12:27 INFO  SecurityManager:54 - Changing view acls groups to: 
2018-07-19 18:12:27 INFO  SecurityManager:54 - Changing modify acls groups to: 
2018-07-19 18:12:27 INFO  SecurityManager:54 - SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
2018-07-19 18:12:28 INFO  SparkPodInitContainer:54 - Downloading remote jars: Some(http://central.maven.org/maven2/org/apache/commons/commons-pool2/2.5.0/commons-pool2-2.5.0.jar,http://central.maven.org/maven2/redis/clients/jedis/2.9.0/jedis-2.9.0.jar,http://dl.bintray.com/spark-packages/maven/RedisLabs/spark-redis/0.3.2/spark-redis-0.3.2.jar,http://central.maven.org/maven2/org/apache/kafka/kafka-clients/0.10.0.1/kafka-clients-0.10.0.1.jar,http://central.maven.org/maven2/org/apache/spark/spark-sql-kafka-0-10_2.11/2.3.1/spark-sql-kafka-0-10_2.11-2.3.1.jar,http://central.maven.org/maven2/org/apache/spark/spark-sql_2.11/2.3.1/spark-sql_2.11-2.3.1.jar,http://central.maven.org/maven2/com/typesafe/config/1.3.2/config-1.3.2.jar,http://central.maven.org/maven2/mysql/mysql-connector-java/8.0.11/mysql-connector-java-8.0.11.jar,gs://uncoil-artifacts/uncoil-job/uncoil-job-ryan.jar,gs://uncoil-artifacts/uncoil-job/uncoil-job-ryan.jar)
2018-07-19 18:12:28 INFO  SparkPodInitContainer:54 - Downloading remote files: None
2018-07-19 18:12:28 INFO  Utils:54 - Fetching http://central.maven.org/maven2/redis/clients/jedis/2.9.0/jedis-2.9.0.jar to /var/spark-data/spark-jars/fetchFileTemp8132912689783452936.tmp
2018-07-19 18:12:28 INFO  Utils:54 - Fetching http://central.maven.org/maven2/org/apache/kafka/kafka-clients/0.10.0.1/kafka-clients-0.10.0.1.jar to /var/spark-data/spark-jars/fetchFileTemp8108929517189477087.tmp
2018-07-19 18:12:28 INFO  Utils:54 - Fetching http://central.maven.org/maven2/org/apache/commons/commons-pool2/2.5.0/commons-pool2-2.5.0.jar to /var/spark-data/spark-jars/fetchFileTemp29997743187335354.tmp
2018-07-19 18:12:28 INFO  Utils:54 - Fetching http://central.maven.org/maven2/org/apache/spark/spark-sql-kafka-0-10_2.11/2.3.1/spark-sql-kafka-0-10_2.11-2.3.1.jar to /var/spark-data/spark-jars/fetchFileTemp918990109718368724.tmp
2018-07-19 18:12:28 INFO  Utils:54 - Fetching http://dl.bintray.com/spark-packages/maven/RedisLabs/spark-redis/0.3.2/spark-redis-0.3.2.jar to /var/spark-data/spark-jars/fetchFileTemp3021150405751723647.tmp
2018-07-19 18:12:28 INFO  Utils:54 - Fetching http://central.maven.org/maven2/org/apache/spark/spark-sql_2.11/2.3.1/spark-sql_2.11-2.3.1.jar to /var/spark-data/spark-jars/fetchFileTemp8195280120682746690.tmp
2018-07-19 18:12:28 INFO  Utils:54 - Fetching http://central.maven.org/maven2/com/typesafe/config/1.3.2/config-1.3.2.jar to /var/spark-data/spark-jars/fetchFileTemp120501099999836480.tmp
2018-07-19 18:12:28 INFO  Utils:54 - Fetching http://central.maven.org/maven2/mysql/mysql-connector-java/8.0.11/mysql-connector-java-8.0.11.jar to /var/spark-data/spark-jars/fetchFileTemp6532435163420374918.tmp
2018-07-19 18:12:28 INFO  GoogleHadoopFileSystemBase:637 - GHFS version: 1.9.2-hadoop2
2018-07-19 18:12:28 INFO  SparkPodInitContainer:54 - Finished downloading application dependencies.

FYI using a different jar path when running this, but that file does exist

liyinan926 commented 6 years ago

Interestingly, the logs didn't show any attempt to download gs://uncoil-artifacts/uncoil-job/uncoil-job.jar, although Downloading remote jars: Some(...) does include the jar.

ryancampbell commented 6 years ago

Here is working 2.3.0 in production to compare

++ id -u
+ myuid=0
++ id -g
+ mygid=0
++ getent passwd 0
+ uidentry=root:x:0:0:root:/root:/bin/ash
+ '[' -z root:x:0:0:root:/root:/bin/ash ']'
+ SPARK_K8S_CMD=init
+ '[' -z init ']'
+ shift 1
+ SPARK_CLASSPATH=':/opt/spark/jars/*'
+ sed 's/[^=]*=\(.*\)/\1/g'
+ grep SPARK_JAVA_OPT_
+ env
+ readarray -t SPARK_JAVA_OPTS
+ '[' -n '' ']'
+ '[' -n '' ']'
+ case "$SPARK_K8S_CMD" in
+ CMD=("$SPARK_HOME/bin/spark-class" "org.apache.spark.deploy.k8s.SparkPodInitContainer" "$@")
+ exec /sbin/tini -s -- /opt/spark/bin/spark-class org.apache.spark.deploy.k8s.SparkPodInitContainer /etc/spark-init/spark-init.properties
2018-07-19 18:16:31 INFO  SparkPodInitContainer:54 - Starting init-container to download Spark application dependencies.
2018-07-19 18:16:31 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2018-07-19 18:16:32 INFO  SecurityManager:54 - Changing view acls to: root
2018-07-19 18:16:32 INFO  SecurityManager:54 - Changing modify acls to: root
2018-07-19 18:16:32 INFO  SecurityManager:54 - Changing view acls groups to: 
2018-07-19 18:16:32 INFO  SecurityManager:54 - Changing modify acls groups to: 
2018-07-19 18:16:32 INFO  SecurityManager:54 - SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
2018-07-19 18:16:32 INFO  SparkPodInitContainer:54 - Downloading remote jars: Some(http://central.maven.org/maven2/org/apache/commons/commons-pool2/2.5.0/commons-pool2-2.5.0.jar,http://central.maven.org/maven2/redis/clients/jedis/2.9.0/jedis-2.9.0.jar,http://dl.bintray.com/spark-packages/maven/RedisLabs/spark-redis/0.3.2/spark-redis-0.3.2.jar,http://central.maven.org/maven2/org/apache/kafka/kafka-clients/0.10.0.1/kafka-clients-0.10.0.1.jar,http://central.maven.org/maven2/org/apache/spark/spark-sql-kafka-0-10_2.11/2.3.0/spark-sql-kafka-0-10_2.11-2.3.0.jar,http://central.maven.org/maven2/org/apache/spark/spark-sql_2.11/2.3.0/spark-sql_2.11-2.3.0.jar,http://central.maven.org/maven2/com/typesafe/config/1.3.2/config-1.3.2.jar,http://central.maven.org/maven2/mysql/mysql-connector-java/8.0.11/mysql-connector-java-8.0.11.jar,gs://uncoil-artifacts/uncoil-job-production/uncoil-job-production.jar,gs://uncoil-artifacts/uncoil-job-production/uncoil-job-production.jar)
2018-07-19 18:16:32 INFO  SparkPodInitContainer:54 - Downloading remote files: None
2018-07-19 18:16:32 INFO  Utils:54 - Fetching http://central.maven.org/maven2/org/apache/kafka/kafka-clients/0.10.0.1/kafka-clients-0.10.0.1.jar to /var/spark-data/spark-jars/fetchFileTemp7252574674392297363.tmp
2018-07-19 18:16:32 INFO  Utils:54 - Fetching http://central.maven.org/maven2/redis/clients/jedis/2.9.0/jedis-2.9.0.jar to /var/spark-data/spark-jars/fetchFileTemp7516920988049583109.tmp
2018-07-19 18:16:32 INFO  Utils:54 - Fetching http://central.maven.org/maven2/org/apache/spark/spark-sql-kafka-0-10_2.11/2.3.0/spark-sql-kafka-0-10_2.11-2.3.0.jar to /var/spark-data/spark-jars/fetchFileTemp7809335549667476709.tmp
2018-07-19 18:16:32 INFO  Utils:54 - Fetching http://central.maven.org/maven2/org/apache/commons/commons-pool2/2.5.0/commons-pool2-2.5.0.jar to /var/spark-data/spark-jars/fetchFileTemp6980777093692990708.tmp
2018-07-19 18:16:32 INFO  Utils:54 - Fetching http://central.maven.org/maven2/org/apache/spark/spark-sql_2.11/2.3.0/spark-sql_2.11-2.3.0.jar to /var/spark-data/spark-jars/fetchFileTemp1972656710207288539.tmp
2018-07-19 18:16:32 INFO  Utils:54 - Fetching http://central.maven.org/maven2/com/typesafe/config/1.3.2/config-1.3.2.jar to /var/spark-data/spark-jars/fetchFileTemp6053943254686834708.tmp
2018-07-19 18:16:32 INFO  Utils:54 - Fetching http://central.maven.org/maven2/mysql/mysql-connector-java/8.0.11/mysql-connector-java-8.0.11.jar to /var/spark-data/spark-jars/fetchFileTemp2775901386905387678.tmp
2018-07-19 18:16:32 INFO  Utils:54 - Fetching http://dl.bintray.com/spark-packages/maven/RedisLabs/spark-redis/0.3.2/spark-redis-0.3.2.jar to /var/spark-data/spark-jars/fetchFileTemp6328779391289019710.tmp
2018-07-19 18:16:32 INFO  GoogleHadoopFileSystemBase:607 - GHFS version: 1.6.3-hadoop2
2018-07-19 18:16:33 WARN  GoogleHadoopFileSystemBase:1876 - No working directory configured, using default: 'gs://uncoil-artifacts/'
2018-07-19 18:16:33 WARN  GoogleHadoopFileSystemBase:1876 - No working directory configured, using default: 'gs://uncoil-artifacts/'
2018-07-19 18:16:33 INFO  Utils:54 - Fetching gs://uncoil-artifacts/uncoil-job-production/uncoil-job-production.jar to /var/spark-data/spark-jars/fetchFileTemp8154449709090930249.tmp
2018-07-19 18:16:33 INFO  Utils:54 - Fetching gs://uncoil-artifacts/uncoil-job-production/uncoil-job-production.jar to /var/spark-data/spark-jars/fetchFileTemp8629099277871937989.tmp
2018-07-19 18:16:33 WARN  GoogleCloudStorageReadChannel:493 - Channel for 'gs://uncoil-artifacts/uncoil-job-production/uncoil-job-production.jar' is not open.
2018-07-19 18:16:33 INFO  Utils:54 - /var/spark-data/spark-jars/fetchFileTemp8629099277871937989.tmp has been previously copied to /var/spark-data/spark-jars/uncoil-job-production.jar
2018-07-19 18:16:33 WARN  GoogleCloudStorageReadChannel:493 - Channel for 'gs://uncoil-artifacts/uncoil-job-production/uncoil-job-production.jar' is not open.
2018-07-19 18:16:33 INFO  SparkPodInitContainer:54 - Finished downloading application dependencies.
ryancampbell commented 6 years ago

I do see this GHFS version bump

2018-07-19 18:12:28 INFO GoogleHadoopFileSystemBase:637 - GHFS version: 1.9.2-hadoop2

2018-07-19 18:16:32 INFO GoogleHadoopFileSystemBase:607 - GHFS version: 1.6.3-hadoop2

liyinan926 commented 6 years ago

Interesting. The image with GCS support uses https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-latest-hadoop2.jar. This must have been updated from version 1.6.3 to 1.9.2.

ryancampbell commented 6 years ago

I switched to https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-1.6.3-hadoop2.jar and it works! So must be breaking changes in the latest connector. Closing this and can open a new ticket

liyinan926 commented 6 years ago

Cool! Looking at the change list at https://github.com/GoogleCloudPlatform/bigdata-interop/blob/fe662298d6c0d892be0468c660d5ca76f8fc0fcc/gcs/CHANGES.txt and trying to figure what could break.

ryancampbell commented 6 years ago

Did the same, didn't see anything obvious, although thought it was intersting fs.gs.project.id is now optional

liyinan926 commented 6 years ago

Yeah, that becomes optional. I think fs.gs.system.bucket has been deprecated also. I'm gonna dig more on this.