kubeflow / spark-operator

Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.
Apache License 2.0
2.75k stars 1.37k forks source link

Spark 3.0 support #702

Closed Jeffwan closed 3 years ago

Jeffwan commented 4 years ago

/question

Spark 3.0-preview is ready and official release targets to next Q1. https://spark.apache.org/news/spark-3.0.0-preview.html

Some of the features maybe not be supported in spark-operator. Trying to understand the deployment cycle. When should we start dev work in spark-operator?

gaocegege commented 4 years ago

Ref https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/issues/693#issuecomment-553707420

liyinan926 commented 4 years ago

Some of the most notable features in Spark 3.0 relevant to Kubernetes are:

  1. Pod templating support: this doesn't matter much for users using the Spark Operator because they define pod templates in the SparkApplication spec.
  2. Kerberos support: ref #306.
  3. An option to use a hostPath volume for the scratch space (work dir).
  4. An option to keep the executor pods after a job is completed.

Both 3 and 4 are just a matter of adding a new optional field in SparkApplicationSpec and set the corresponding Spark config option, which is straightforward.

Jeffwan commented 4 years ago

HI Yinan, I will take 3 and 4 and I already have changes ready for them.

abhisrao commented 4 years ago

Hi @liyinan926 For 2, changes are available as part of this PR.

liyinan926 commented 4 years ago

Thanks @Jeffwan! @abhisrao can you create a PR to upstream your change?

abhisrao commented 4 years ago

Sorry for the delay in responding @liyinan926 We had to make the changes on fork branch as there is discussion going on internally about CLA.. Will update when there is some progress on the CLA aspect.

AceHack commented 4 years ago

Can we get a 3.0 drop of the operator?

der-ali commented 4 years ago

Is it possible to use the operator with spark 3.0 ?

Jeffwan commented 4 years ago

Is it possible to use the operator with spark 3.0 ?

3.0 is not released officially and only preview is available. It's code freeze and should be ready soon. I think it would be great to cut 3.0 once it's out.

ringtail commented 4 years ago

@liyinan926 @Jeffwan @gaocegege PodTemplate is extremely helpful. We run spark on serverless kubernetes(virtual kubelet). Link to https://issues.apache.org/jira/browse/SPARK-31173

gaocegege commented 4 years ago

@ringtail Yeah I think so. Thus I'd appreciate it if we could migrate to podTemplate.

Jeffwan commented 4 years ago

@ringtail @gaocegege What's the motivation to use podTemplate for spark-operator users? As I understand, Spark users who want podTemplate is because using spark conf to set volumes, envs is kind of tedious when user use spark-submit.

Most spark-operator users are familiar with k8s. I am thinking spark-operator has mutation webhook to inject all pods needed envs, pvs information to pod directly in a native k8s way. Not sure if user still want to use pod-template. Can this meet the requirement? Could you help clarify the use case? If there's something missing, then I think we should catch up the support.

ringtail commented 4 years ago

@ringtail @gaocegege What's the motivation to use podTemplate for spark-operator users? As I understand, Spark users who want podTemplate is because using spark conf to set volumes, envs is kind of tedious when user use spark-submit.

Most spark-operator users are familiar with k8s. I am thinking spark-operator has mutation webhook to inject all pods needed envs, pvs information to pod directly in a native k8s way. Not sure if user still want to use pod-template. Can this meet the requirement? Could you help clarify the use case? If there's something missing, then I think we should catch up the support.

For us. Pod creation performance is the point of concern. https://issues.apache.org/jira/browse/SPARK-31173

Jeffwan commented 4 years ago

For us. Pod creation performance is the point of concern. https://issues.apache.org/jira/browse/SPARK-31173

More precisely, I think webhook performance drags down the overall performance ?

liyinan926 commented 4 years ago

Put performance penalty of using the webhook aside, I don't see a need to support pod template as a first-class citizen in the operator that already has the declarative API for pod config. We can definitely have an option to internally convert certain fields in a SparkApplicationSpec into a pod template file and use that for Spark 3.0 instead of the webhook. However, we don't have a plan to add a field with type PodTemplateSpec.

ringtail commented 4 years ago

Put performance penalty of using the webhook aside, I don't see a need to support pod template as a first-class citizen in the operator that already has the declarative API for pod config. We can definitely have an option to internally convert certain fields in a SparkApplicationSpec into a pod template file and use that for Spark 3.0 instead of the webhook. However, we don't have a plan to add a field with type PodTemplateSpec.

Yes. I agree with this design of CRD spec.

jkleckner commented 4 years ago

Any insight into the cause of the performance penalty? Is it reproducible on GKE?

ringtail commented 4 years ago

Any insight into the cause of the performance penalty? Is it reproducible on GKE?

You can try to disable webhook

jiayue-zhang commented 4 years ago

Hi, is there a dockerfile for Spark3.0 spark-operator? I followed the one for 2.4.5 https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/spark-docker/Dockerfile (by replacing the base image with spark3.0.0) but got error when submitting the Spark app on GKE.

cmontemuino commented 4 years ago

@jiayue-zhang it looks like there's no spark 3.0 image: https://console.cloud.google.com/gcr/images/spark-operator/GLOBAL/spark?gcrImageListsize=30 I couldn't find the repo where these images are being built.

ringtail commented 4 years ago

Hi, is there a dockerfile for Spark3.0 spark-operator? I followed the one for 2.4.5 https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/spark-docker/Dockerfile (by replacing the base image with spark3.0.0) but got error when submitting the Spark app on GKE.

Could you provide the error log?

jiayue-zhang commented 4 years ago

@cmontemuino That's right. But since spark-operator has developed features on top of Spark 3.0, I think developers have the image and just not officially released it yet. So I'm asking the Dockerfile that I can make a 3.0 image on my side to work on it because we want to try Spark 3.0 in GCP.

@ringtail Log 1:

...
org.apache.spark#spark-avro_2.11 added as a dependency
org.influxdb#influxdb-java added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-4df89fb7-c826-4ac2-af03-530802ffb429;1.0
confs: [default]
Exception in thread "main" java.io.FileNotFoundException: /opt/spark/.ivy2/cache/resolved-org.apache.spark-spark-submit-parent-4df89fb7-c826-4ac2-af03-530802ffb429-1.0.xml (No such file or directory)
    at java.io.FileOutputStream.open0(Native Method)
    at java.io.FileOutputStream.open(FileOutputStream.java:270)
    at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
    at java.io.FileOutputStream.<init>(FileOutputStream.java:162)
    at org.apache.ivy.plugins.parser.xml.XmlModuleDescriptorWriter.write(XmlModuleDescriptorWriter.java:70)
    at org.apache.ivy.plugins.parser.xml.XmlModuleDescriptorWriter.write(XmlModuleDescriptorWriter.java:62)
    at org.apache.ivy.core.module.descriptor.DefaultModuleDescriptor.toIvyFile(DefaultModuleDescriptor.java:563)
    at org.apache.ivy.core.cache.DefaultResolutionCacheManager.saveResolvedModuleDescriptor(DefaultResolutionCacheManager.java:176)
    at org.apache.ivy.core.resolve.ResolveEngine.resolve(ResolveEngine.java:245)
    at org.apache.ivy.Ivy.resolve(Ivy.java:523)
    at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1387)
    at org.apache.spark.deploy.DependencyUtils$.resolveMavenDependencies(DependencyUtils.scala:54)
    at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:308)
    at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:871)
    at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
    at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
    at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
    at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1007)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1016)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

It seems to be related to loading extra jars. So I commented out my spark.jars.packages in yaml to see more logs. This time I got:

Exception in thread "main" java.lang.NoClassDefFoundError: com/google/common/base/Preconditions
    at org.apache.hadoop.conf.Configuration$DeprecationDelta.<init>(Configuration.java:328)
    at org.apache.hadoop.conf.Configuration$DeprecationDelta.<init>(Configuration.java:341)
    at org.apache.hadoop.conf.Configuration.<clinit>(Configuration.java:423)
    at org.apache.spark.deploy.SparkHadoopUtil$.newConfiguration(SparkHadoopUtil.scala:426)
    at org.apache.spark.deploy.SparkSubmit.$anonfun$prepareSubmitEnvironment$2(SparkSubmit.scala:342)
    at scala.Option.getOrElse(Option.scala:189)
    at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:342)
    at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:871)
    at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
    at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
    at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
    at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1007)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1016)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: com.google.common.base.Preconditions
    at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
    ... 14 more

This lead me to think that my Dockerfile might not be right. I use our own spark:v2.4.5-gcs image for 2.4.5 following https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/spark-docker/Dockerfile. Then I use similar dockerfile to build 3.0.0 except the ARG SPARK_IMAGE but got error. This is my dockerfile for 3.0.0:

ARG SPARK_IMAGE=gcr.io/our-project/spark:v3.0.0-rc1
FROM ${SPARK_IMAGE}

# Setup dependencies for Google Cloud Storage access.
RUN rm $SPARK_HOME/jars/guava-14.0.1.jar
ADD https://repo1.maven.org/maven2/com/google/guava/guava/23.0/guava-23.0.jar $SPARK_HOME/jars
ADD https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-hadoop2-2.0.1.jar $SPARK_HOME/jars

ENTRYPOINT ["/opt/entrypoint.sh"]

The reason I'm using gcs-connector-hadoop2-2.0.1.jar not gcs-connector-latest-hadoop2.jar is because we need use hadoop 2.7 which is by default used by Spark. More discussion is in https://github.com/GoogleCloudDataproc/hadoop-connectors/issues/323 but I don't think the error here is related to that. My yaml file is same as the one I used for spark2.4.5. Some highlights are:

apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
spec:
    image: "gcr.io/our-repo/spark:v3.0.0-rc1-gcs"
    sparkVersion: "3.0.0"
    sparkConf:
        "spark.jars.packages": "org.apache.spark:spark-avro_2.11:2.4.5,org.influxdb:influxdb-java:2.7"
ringtail commented 4 years ago

@cmontemuino That's right. But since spark-operator has developed features on top of Spark 3.0, I think developers have the image and just not officially released it yet. So I'm asking the Dockerfile that I can make a 3.0 image on my side to work on it because we want to try Spark 3.0 in GCP.

@ringtail Log 1:

...
org.apache.spark#spark-avro_2.11 added as a dependency
org.influxdb#influxdb-java added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-4df89fb7-c826-4ac2-af03-530802ffb429;1.0
confs: [default]
Exception in thread "main" java.io.FileNotFoundException: /opt/spark/.ivy2/cache/resolved-org.apache.spark-spark-submit-parent-4df89fb7-c826-4ac2-af03-530802ffb429-1.0.xml (No such file or directory)
  at java.io.FileOutputStream.open0(Native Method)
  at java.io.FileOutputStream.open(FileOutputStream.java:270)
  at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
  at java.io.FileOutputStream.<init>(FileOutputStream.java:162)
  at org.apache.ivy.plugins.parser.xml.XmlModuleDescriptorWriter.write(XmlModuleDescriptorWriter.java:70)
  at org.apache.ivy.plugins.parser.xml.XmlModuleDescriptorWriter.write(XmlModuleDescriptorWriter.java:62)
  at org.apache.ivy.core.module.descriptor.DefaultModuleDescriptor.toIvyFile(DefaultModuleDescriptor.java:563)
  at org.apache.ivy.core.cache.DefaultResolutionCacheManager.saveResolvedModuleDescriptor(DefaultResolutionCacheManager.java:176)
  at org.apache.ivy.core.resolve.ResolveEngine.resolve(ResolveEngine.java:245)
  at org.apache.ivy.Ivy.resolve(Ivy.java:523)
  at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1387)
  at org.apache.spark.deploy.DependencyUtils$.resolveMavenDependencies(DependencyUtils.scala:54)
  at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:308)
  at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:871)
  at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
  at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
  at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
  at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1007)
  at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1016)
  at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

It seems to be related to loading extra jars. So I commented out my spark.jars.packages in yaml to see more logs. This time I got:

Exception in thread "main" java.lang.NoClassDefFoundError: com/google/common/base/Preconditions
  at org.apache.hadoop.conf.Configuration$DeprecationDelta.<init>(Configuration.java:328)
  at org.apache.hadoop.conf.Configuration$DeprecationDelta.<init>(Configuration.java:341)
  at org.apache.hadoop.conf.Configuration.<clinit>(Configuration.java:423)
  at org.apache.spark.deploy.SparkHadoopUtil$.newConfiguration(SparkHadoopUtil.scala:426)
  at org.apache.spark.deploy.SparkSubmit.$anonfun$prepareSubmitEnvironment$2(SparkSubmit.scala:342)
  at scala.Option.getOrElse(Option.scala:189)
  at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:342)
  at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:871)
  at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
  at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
  at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
  at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1007)
  at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1016)
  at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: com.google.common.base.Preconditions
  at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
  at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
  ... 14 more

This lead me to think that my Dockerfile might not be right. I use our own spark:v2.4.5-gcs image for 2.4.5 following https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/spark-docker/Dockerfile. Then I use similar dockerfile to build 3.0.0 except the ARG SPARK_IMAGE but got error. This is my dockerfile for 3.0.0:

ARG SPARK_IMAGE=gcr.io/our-project/spark:v3.0.0-rc1
FROM ${SPARK_IMAGE}

# Setup dependencies for Google Cloud Storage access.
RUN rm $SPARK_HOME/jars/guava-14.0.1.jar
ADD https://repo1.maven.org/maven2/com/google/guava/guava/23.0/guava-23.0.jar $SPARK_HOME/jars
ADD https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-hadoop2-2.0.1.jar $SPARK_HOME/jars

ENTRYPOINT ["/opt/entrypoint.sh"]

The reason I'm using gcs-connector-hadoop2-2.0.1.jar not gcs-connector-latest-hadoop2.jar is because we need use hadoop 2.7 which is by default used by Spark. More discussion is in GoogleCloudDataproc/hadoop-connectors#323 but I don't think the error here is related to that. My yaml file is same as the one I used for spark2.4.5. Some highlights are:

apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
spec:
    image: "gcr.io/our-repo/spark:v3.0.0-rc1-gcs"
    sparkVersion: "3.0.0"
    sparkConf:
        "spark.jars.packages": "org.apache.spark:spark-avro_2.11:2.4.5,org.influxdb:influxdb-java:2.7"

I am not sure about the error. But if you want to create a spark base image. You can refer to this repo. (https://github.com/AliyunContainerService/spark/blob/alibabacloud-v2.4.5/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile)

cmontemuino commented 4 years ago

@jiayue-zhang, looking at the logs you've provided when working with spark 3, please bear in mind that Spark 3.0 is compiled against 2.12. If you include a package compiled against 2.11 it should fail (e.g., org.apache.spark#spark-avro_2.11).

An alternative to what @ringtail is suggesting would be downloading the official preview2 package and use the docker-image-tool. If plain spark 3 is fine with you, then that's all what you need.

This is minimal script of what we currently do for building a raw spark 3 image:

SPARK_VERSION=3.0.0-preview2
SPARK_PACKAGE=spark-${SPARK_VERSION}-bin-hadoop3.2.tgz
SPARK_SOURCE_DIR=${SPARK_PACKAGE%.tgz}
SPARK_UID=${SPARK_UID:-0}  # this SPARK_UID is new in the image that in with preview2

wget "https://archive.apache.org/dist/spark/spark-${SPARK_VERSION}/${SPARK_PACKAGE}"
tar xvzf "${SPARK_PACKAGE}"

cd "${SPARK_SOURCE_DIR}"
BUILD_OPTIONS=(-r "$YOUR_DOCKER_REPO" -t "$SOME_TAG" -u "$SPARK_UID")

./bin/docker-image-tool.sh "${BUILD_OPTIONS[@]}" build
liyinan926 commented 4 years ago

We currently don't have an operator image based on Spark 3.0.0 yet, since it's not officially out yet. Like what @cmontemuino suggested, you could build a vanila Spark 3.0.0-preview2 Spark image using docker-image-tool.sh, and then build a custom operator image based on that Spark image. When installing using the Helm chart, you can specify the operator image you want to use.

chickenPopcorn commented 4 years ago

it's officially here. https://spark.apache.org/releases/spark-release-3-0-0.html

liyinan926 commented 4 years ago

Will create a new release based on Spark 3.0 soon. Stay tuned. Thanks.

liyinan926 commented 4 years ago

I have just built and pushed the following images:

gcr.io/spark-operator/spark:v3.0.0 gcr.io/spark-operator/spark-py:v3.0.0 gcr.io/spark-operator/spark-operator:v1beta2-1.1.2-3.0.0

Jeffwan commented 4 years ago

@liyinan926

What's the support plan? Do we want to maintain 2.x and 3.x in different branches and cut release separately? Or move to 3.x?

liyinan926 commented 4 years ago

The short term plan is to switch the master branch to be based on 3.x and maintain 2.x in a separate branch. The long term plan is to move to 3.x completely. Also although I haven't tested it yet, but I believe a 3.x based operator image should be able to launch jobs in both 2.4.x and 3.x.

hynix commented 4 years ago

I have just built and pushed the following images:

gcr.io/spark-operator/spark:v3.0.0 gcr.io/spark-operator/spark-py:v3.0.0 gcr.io/spark-operator/spark-operator:v1beta2-1.1.2-3.0.0

@liyinan926, could You release Java 11 based images as well?

Thanks

AceHack commented 4 years ago

Yes java 11 please

liyinan926 commented 4 years ago

We now support in the latest release pretty much all the new config options and enhancements introduced in Spark 3.0.0 except for pod template files, which don't make much sense in the context of the operator. The master branch has also been switched to be based on Spark 3.0.