Closed Jeffwan closed 3 years ago
Some of the most notable features in Spark 3.0 relevant to Kubernetes are:
SparkApplication
spec.hostPath
volume for the scratch space (work dir).Both 3 and 4 are just a matter of adding a new optional field in SparkApplicationSpec
and set the corresponding Spark config option, which is straightforward.
HI Yinan, I will take 3 and 4 and I already have changes ready for them.
Thanks @Jeffwan! @abhisrao can you create a PR to upstream your change?
Sorry for the delay in responding @liyinan926 We had to make the changes on fork branch as there is discussion going on internally about CLA.. Will update when there is some progress on the CLA aspect.
Can we get a 3.0 drop of the operator?
Is it possible to use the operator with spark 3.0 ?
Is it possible to use the operator with spark 3.0 ?
3.0 is not released officially and only preview is available. It's code freeze and should be ready soon. I think it would be great to cut 3.0 once it's out.
@liyinan926 @Jeffwan @gaocegege PodTemplate is extremely helpful. We run spark on serverless kubernetes(virtual kubelet). Link to https://issues.apache.org/jira/browse/SPARK-31173
@ringtail Yeah I think so. Thus I'd appreciate it if we could migrate to podTemplate.
@ringtail @gaocegege What's the motivation to use podTemplate for spark-operator users?
As I understand, Spark users who want podTemplate is because using spark conf to set volumes, envs is kind of tedious when user use spark-submit
.
Most spark-operator
users are familiar with k8s. I am thinking spark-operator has mutation webhook to inject all pods needed envs, pvs information to pod directly in a native k8s way. Not sure if user still want to use pod-template. Can this meet the requirement? Could you help clarify the use case? If there's something missing, then I think we should catch up the support.
@ringtail @gaocegege What's the motivation to use podTemplate for spark-operator users? As I understand, Spark users who want podTemplate is because using spark conf to set volumes, envs is kind of tedious when user use
spark-submit
.Most
spark-operator
users are familiar with k8s. I am thinking spark-operator has mutation webhook to inject all pods needed envs, pvs information to pod directly in a native k8s way. Not sure if user still want to use pod-template. Can this meet the requirement? Could you help clarify the use case? If there's something missing, then I think we should catch up the support.
For us. Pod creation performance is the point of concern. https://issues.apache.org/jira/browse/SPARK-31173
For us. Pod creation performance is the point of concern. https://issues.apache.org/jira/browse/SPARK-31173
More precisely, I think webhook performance drags down the overall performance ?
Put performance penalty of using the webhook aside, I don't see a need to support pod template as a first-class citizen in the operator that already has the declarative API for pod config. We can definitely have an option to internally convert certain fields in a SparkApplicationSpec
into a pod template file and use that for Spark 3.0 instead of the webhook. However, we don't have a plan to add a field with type PodTemplateSpec
.
Put performance penalty of using the webhook aside, I don't see a need to support pod template as a first-class citizen in the operator that already has the declarative API for pod config. We can definitely have an option to internally convert certain fields in a
SparkApplicationSpec
into a pod template file and use that for Spark 3.0 instead of the webhook. However, we don't have a plan to add a field with typePodTemplateSpec
.
Yes. I agree with this design of CRD spec.
Any insight into the cause of the performance penalty? Is it reproducible on GKE?
Any insight into the cause of the performance penalty? Is it reproducible on GKE?
You can try to disable webhook
Hi, is there a dockerfile for Spark3.0 spark-operator? I followed the one for 2.4.5 https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/spark-docker/Dockerfile (by replacing the base image with spark3.0.0) but got error when submitting the Spark app on GKE.
@jiayue-zhang it looks like there's no spark 3.0 image: https://console.cloud.google.com/gcr/images/spark-operator/GLOBAL/spark?gcrImageListsize=30 I couldn't find the repo where these images are being built.
Hi, is there a dockerfile for Spark3.0 spark-operator? I followed the one for 2.4.5 https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/spark-docker/Dockerfile (by replacing the base image with spark3.0.0) but got error when submitting the Spark app on GKE.
Could you provide the error log?
@cmontemuino That's right. But since spark-operator has developed features on top of Spark 3.0, I think developers have the image and just not officially released it yet. So I'm asking the Dockerfile that I can make a 3.0 image on my side to work on it because we want to try Spark 3.0 in GCP.
@ringtail Log 1:
...
org.apache.spark#spark-avro_2.11 added as a dependency
org.influxdb#influxdb-java added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-4df89fb7-c826-4ac2-af03-530802ffb429;1.0
confs: [default]
Exception in thread "main" java.io.FileNotFoundException: /opt/spark/.ivy2/cache/resolved-org.apache.spark-spark-submit-parent-4df89fb7-c826-4ac2-af03-530802ffb429-1.0.xml (No such file or directory)
at java.io.FileOutputStream.open0(Native Method)
at java.io.FileOutputStream.open(FileOutputStream.java:270)
at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
at java.io.FileOutputStream.<init>(FileOutputStream.java:162)
at org.apache.ivy.plugins.parser.xml.XmlModuleDescriptorWriter.write(XmlModuleDescriptorWriter.java:70)
at org.apache.ivy.plugins.parser.xml.XmlModuleDescriptorWriter.write(XmlModuleDescriptorWriter.java:62)
at org.apache.ivy.core.module.descriptor.DefaultModuleDescriptor.toIvyFile(DefaultModuleDescriptor.java:563)
at org.apache.ivy.core.cache.DefaultResolutionCacheManager.saveResolvedModuleDescriptor(DefaultResolutionCacheManager.java:176)
at org.apache.ivy.core.resolve.ResolveEngine.resolve(ResolveEngine.java:245)
at org.apache.ivy.Ivy.resolve(Ivy.java:523)
at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1387)
at org.apache.spark.deploy.DependencyUtils$.resolveMavenDependencies(DependencyUtils.scala:54)
at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:308)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:871)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1007)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1016)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
It seems to be related to loading extra jars. So I commented out my spark.jars.packages
in yaml to see more logs. This time I got:
Exception in thread "main" java.lang.NoClassDefFoundError: com/google/common/base/Preconditions
at org.apache.hadoop.conf.Configuration$DeprecationDelta.<init>(Configuration.java:328)
at org.apache.hadoop.conf.Configuration$DeprecationDelta.<init>(Configuration.java:341)
at org.apache.hadoop.conf.Configuration.<clinit>(Configuration.java:423)
at org.apache.spark.deploy.SparkHadoopUtil$.newConfiguration(SparkHadoopUtil.scala:426)
at org.apache.spark.deploy.SparkSubmit.$anonfun$prepareSubmitEnvironment$2(SparkSubmit.scala:342)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:342)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:871)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1007)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1016)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: com.google.common.base.Preconditions
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
... 14 more
This lead me to think that my Dockerfile might not be right. I use our own spark:v2.4.5-gcs
image for 2.4.5 following https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/spark-docker/Dockerfile. Then I use similar dockerfile to build 3.0.0 except the ARG SPARK_IMAGE
but got error. This is my dockerfile for 3.0.0:
ARG SPARK_IMAGE=gcr.io/our-project/spark:v3.0.0-rc1
FROM ${SPARK_IMAGE}
# Setup dependencies for Google Cloud Storage access.
RUN rm $SPARK_HOME/jars/guava-14.0.1.jar
ADD https://repo1.maven.org/maven2/com/google/guava/guava/23.0/guava-23.0.jar $SPARK_HOME/jars
ADD https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-hadoop2-2.0.1.jar $SPARK_HOME/jars
ENTRYPOINT ["/opt/entrypoint.sh"]
The reason I'm using gcs-connector-hadoop2-2.0.1.jar
not gcs-connector-latest-hadoop2.jar
is because we need use hadoop 2.7 which is by default used by Spark. More discussion is in https://github.com/GoogleCloudDataproc/hadoop-connectors/issues/323 but I don't think the error here is related to that.
My yaml file is same as the one I used for spark2.4.5. Some highlights are:
apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
spec:
image: "gcr.io/our-repo/spark:v3.0.0-rc1-gcs"
sparkVersion: "3.0.0"
sparkConf:
"spark.jars.packages": "org.apache.spark:spark-avro_2.11:2.4.5,org.influxdb:influxdb-java:2.7"
@cmontemuino That's right. But since spark-operator has developed features on top of Spark 3.0, I think developers have the image and just not officially released it yet. So I'm asking the Dockerfile that I can make a 3.0 image on my side to work on it because we want to try Spark 3.0 in GCP.
@ringtail Log 1:
... org.apache.spark#spark-avro_2.11 added as a dependency org.influxdb#influxdb-java added as a dependency :: resolving dependencies :: org.apache.spark#spark-submit-parent-4df89fb7-c826-4ac2-af03-530802ffb429;1.0 confs: [default] Exception in thread "main" java.io.FileNotFoundException: /opt/spark/.ivy2/cache/resolved-org.apache.spark-spark-submit-parent-4df89fb7-c826-4ac2-af03-530802ffb429-1.0.xml (No such file or directory) at java.io.FileOutputStream.open0(Native Method) at java.io.FileOutputStream.open(FileOutputStream.java:270) at java.io.FileOutputStream.<init>(FileOutputStream.java:213) at java.io.FileOutputStream.<init>(FileOutputStream.java:162) at org.apache.ivy.plugins.parser.xml.XmlModuleDescriptorWriter.write(XmlModuleDescriptorWriter.java:70) at org.apache.ivy.plugins.parser.xml.XmlModuleDescriptorWriter.write(XmlModuleDescriptorWriter.java:62) at org.apache.ivy.core.module.descriptor.DefaultModuleDescriptor.toIvyFile(DefaultModuleDescriptor.java:563) at org.apache.ivy.core.cache.DefaultResolutionCacheManager.saveResolvedModuleDescriptor(DefaultResolutionCacheManager.java:176) at org.apache.ivy.core.resolve.ResolveEngine.resolve(ResolveEngine.java:245) at org.apache.ivy.Ivy.resolve(Ivy.java:523) at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1387) at org.apache.spark.deploy.DependencyUtils$.resolveMavenDependencies(DependencyUtils.scala:54) at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:308) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:871) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1007) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1016) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
It seems to be related to loading extra jars. So I commented out my
spark.jars.packages
in yaml to see more logs. This time I got:Exception in thread "main" java.lang.NoClassDefFoundError: com/google/common/base/Preconditions at org.apache.hadoop.conf.Configuration$DeprecationDelta.<init>(Configuration.java:328) at org.apache.hadoop.conf.Configuration$DeprecationDelta.<init>(Configuration.java:341) at org.apache.hadoop.conf.Configuration.<clinit>(Configuration.java:423) at org.apache.spark.deploy.SparkHadoopUtil$.newConfiguration(SparkHadoopUtil.scala:426) at org.apache.spark.deploy.SparkSubmit.$anonfun$prepareSubmitEnvironment$2(SparkSubmit.scala:342) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:342) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:871) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1007) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1016) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.lang.ClassNotFoundException: com.google.common.base.Preconditions at java.net.URLClassLoader.findClass(URLClassLoader.java:382) at java.lang.ClassLoader.loadClass(ClassLoader.java:418) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) at java.lang.ClassLoader.loadClass(ClassLoader.java:351) ... 14 more
This lead me to think that my Dockerfile might not be right. I use our own
spark:v2.4.5-gcs
image for 2.4.5 following https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/spark-docker/Dockerfile. Then I use similar dockerfile to build 3.0.0 except theARG SPARK_IMAGE
but got error. This is my dockerfile for 3.0.0:ARG SPARK_IMAGE=gcr.io/our-project/spark:v3.0.0-rc1 FROM ${SPARK_IMAGE} # Setup dependencies for Google Cloud Storage access. RUN rm $SPARK_HOME/jars/guava-14.0.1.jar ADD https://repo1.maven.org/maven2/com/google/guava/guava/23.0/guava-23.0.jar $SPARK_HOME/jars ADD https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-hadoop2-2.0.1.jar $SPARK_HOME/jars ENTRYPOINT ["/opt/entrypoint.sh"]
The reason I'm using
gcs-connector-hadoop2-2.0.1.jar
notgcs-connector-latest-hadoop2.jar
is because we need use hadoop 2.7 which is by default used by Spark. More discussion is in GoogleCloudDataproc/hadoop-connectors#323 but I don't think the error here is related to that. My yaml file is same as the one I used for spark2.4.5. Some highlights are:apiVersion: "sparkoperator.k8s.io/v1beta2" kind: SparkApplication spec: image: "gcr.io/our-repo/spark:v3.0.0-rc1-gcs" sparkVersion: "3.0.0" sparkConf: "spark.jars.packages": "org.apache.spark:spark-avro_2.11:2.4.5,org.influxdb:influxdb-java:2.7"
I am not sure about the error. But if you want to create a spark base image. You can refer to this repo. (https://github.com/AliyunContainerService/spark/blob/alibabacloud-v2.4.5/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile)
@jiayue-zhang, looking at the logs you've provided when working with spark 3, please bear in mind that Spark 3.0 is compiled against 2.12. If you include a package compiled against 2.11 it should fail (e.g., org.apache.spark#spark-avro_2.11
).
An alternative to what @ringtail is suggesting would be downloading the official preview2 package and use the docker-image-tool. If plain spark 3 is fine with you, then that's all what you need.
This is minimal script of what we currently do for building a raw spark 3 image:
SPARK_VERSION=3.0.0-preview2
SPARK_PACKAGE=spark-${SPARK_VERSION}-bin-hadoop3.2.tgz
SPARK_SOURCE_DIR=${SPARK_PACKAGE%.tgz}
SPARK_UID=${SPARK_UID:-0} # this SPARK_UID is new in the image that in with preview2
wget "https://archive.apache.org/dist/spark/spark-${SPARK_VERSION}/${SPARK_PACKAGE}"
tar xvzf "${SPARK_PACKAGE}"
cd "${SPARK_SOURCE_DIR}"
BUILD_OPTIONS=(-r "$YOUR_DOCKER_REPO" -t "$SOME_TAG" -u "$SPARK_UID")
./bin/docker-image-tool.sh "${BUILD_OPTIONS[@]}" build
We currently don't have an operator image based on Spark 3.0.0 yet, since it's not officially out yet. Like what @cmontemuino suggested, you could build a vanila Spark 3.0.0-preview2 Spark image using docker-image-tool.sh
, and then build a custom operator image based on that Spark image. When installing using the Helm chart, you can specify the operator image you want to use.
it's officially here. https://spark.apache.org/releases/spark-release-3-0-0.html
Will create a new release based on Spark 3.0 soon. Stay tuned. Thanks.
I have just built and pushed the following images:
gcr.io/spark-operator/spark:v3.0.0
gcr.io/spark-operator/spark-py:v3.0.0
gcr.io/spark-operator/spark-operator:v1beta2-1.1.2-3.0.0
@liyinan926
What's the support plan? Do we want to maintain 2.x and 3.x in different branches and cut release separately? Or move to 3.x?
The short term plan is to switch the master branch to be based on 3.x and maintain 2.x in a separate branch. The long term plan is to move to 3.x completely. Also although I haven't tested it yet, but I believe a 3.x based operator image should be able to launch jobs in both 2.4.x and 3.x.
I have just built and pushed the following images:
gcr.io/spark-operator/spark:v3.0.0
gcr.io/spark-operator/spark-py:v3.0.0
gcr.io/spark-operator/spark-operator:v1beta2-1.1.2-3.0.0
@liyinan926, could You release Java 11 based images as well?
Thanks
Yes java 11 please
We now support in the latest release pretty much all the new config options and enhancements introduced in Spark 3.0.0 except for pod template files, which don't make much sense in the context of the operator. The master branch has also been switched to be based on Spark 3.0.
/question
Spark 3.0-preview is ready and official release targets to next Q1. https://spark.apache.org/news/spark-3.0.0-preview.html
Some of the features maybe not be supported in spark-operator. Trying to understand the deployment cycle. When should we start dev work in spark-operator?