Closed yevsh closed 2 months ago
Please double-check permissions of the files to ensure spark can actually read them. As well, try out with the spark-pi example:
image: gcr.io/spark-operator/spark:v3.1.1
imagePullPolicy: Always
mainApplicationFile: local:///opt/spark/examples/jars/spark-examples_2.12-3.1.1.jar
mainClass: org.apache.spark.examples.SparkPi
Hope it helps.
permissions on the jars in ///opt/app/jars?
I explicitly set on all jars -rwxr-xr-x
is this correct definition? maybe not loading spring jars
spec:
deps:
jars:
- local:///opt/app/jars/*
with the example image, looks like jars are loaded, getting some other error
ERROR MetricsConfig: Error loading configuration file /etc/metrics/conf/metrics.properties
java.io.FileNotFoundException: /etc/metrics/conf/metrics.properties (No such file or directory)
but it's not relevant right now
I think it might be something with operator, as i just tried to use image with spark 2.4.5 that works with operator 2.4.5 and it failed on same.
Confirmed: I went to cluster with operator 2.4.5, removed from yaml
deps:
jars:
- local:///opt/app/jars/*
and got the same problem!!
operator doesn't see/read/load the deps jar.
I doubt it has anything to do with the operator; this is just an orchestrator putting together the resources based on the spec you give it. It doesn't really care about your jars in any way... You'll need to make sure that the jars you specify in config are actually part of the image you're telling spark to use.
Also, you should be able to use older spark image with the newer operator, and vice-versa.
The best approach is to start with a working example (like spark-pi) and work towards your use-case. If this used to work on 2.4.5, then use 2.4.5 images with the newer operator. Check out driver/executor options like spark.jars.packages
, spark.executor.extraClassPath
as well. There were quite some changes introduced in spark 3.x which can very easily break things for you and you might need to adapt the spark configuration for that.
What the operator does ultimately is to spin up a spark instance for you and run spark-submit against it. It will use whatever images you tell it to use and things will work as long as you use supported spark configuration.
the problem with the example that it's not using
deps:
jars:
- local:///opt/app/jars/*
Again, i took yaml with 2.4.5 image lunched it in cluster with operator 2.4.5 and it worked. Took same yaml and lunched in cluster with operator 3.1.1 and it failed.
So on both clusters you're using the exact same images? Please also check logs from both operators and describe both driver pods to ensure the same version is used.
now I tried the image with spark 3.1.1 in cluster with operator 2.4.5 and also works. Dont see anything special in describes of both operators
But you are saying its ok to use operator 2.4.5 to spin spark 3.1.1?
also went to cluster with operator 2.4.5, edited they deployment and replaced the image with v1beta2-1.3.3-3.1.1, also not working.
It would be great if you could provide a way to reproduce this with a generic application, so anyone can test and help troubleshoot. Yes, it's fine to use operator 2.4.5 with spark 3.1.1, however, you'll be missing some features of the operator that were introduced (eg support for k8s >= 1.22 etc.). When you observe the application crash, can you please fetch logs from the operator and provide them? Perhaps they'll reveal some clue. Ultimately, the only way someone can actually help is if this is reproducible using a generic application that anyone can run and troubleshoot. Last but not least, there were some issues around metrics reported in some other tickets, even though in that case it was only affecting destruction of application. You could try to enable/disable the metrics server in the operator and observe the behaviour.
also went to cluster with 3.1.1 operator and changed image of operator to 2.4.5 - and it worked. switched back to 3.1.1 image snd stopped working, print in log of sparkapp pod :
WARN DependencyUtils: Local jar /opt/app/jars/* does not exist
Error: Failed to load my.package.my.classApp: org/springframework/boot/CommandLineRunner
I still suspect it has something to to with "deps". I am not using monitoring. k8s version: v1.21.9
i don't see anything special in the operator log.... do I need to look for anything?
So... running operator with image v1beta2-1.3.3-3.1.1
and specifying 2.4.5 image for the driver and worker pods works fine? That would absolve the operator of any suspicion...
Perhaps it's worth trying with explicit jar while ensuring all dependencies are loaded (maybe via spark.executor.extraClassPath
and spark.driver.extraClassPath
). Or maybe try to quote the path (ie - "local:///opt/app/jars/*"
.
Regarding the operator logs... I would compare the spark-submit stage on both working and non-working scenarios to check for differences.
Also review the following:
Somehow, I suspect that the spark-submit doesn't pass local:///opt/app/jars/*
properly...
no no, operator v1beta2-1.3.3-3.1.1 is not working with 2.4.5 image for the driver.
I just replaced v1beta2-1.3.3-3.1.1 operator with 2.4.5 operator - this is what working.
So it must be operator.
Why I need to change anything in yaml if it works in 2.4.5? including quoting anything. also specifying anything in extraClassPath
IMHO it's related to deps: jars:
works with:
"spark.driver.extraClassPath": "local:///opt/app/jars/*"
"spark.executor.extraClassPath": "local:///opt/app/jars/*"
so what does it mean?
bug with deps:jars? can someone check and fix?
what is the impact specifying jars this way?
You'll want to cross-reference the following between 2.4.5 operator and the newer one:
I0204 13:30:43.220952 10 submission.go:65] spark-submit arguments: [/opt/spark/bin/spark-submit --class my.package.my.classApp --master k8s://https://10.96.0.1:443 --deploy-mode cluster --conf spark.kubernetes.namespace=ns --conf spark.app.name=spark-driver --conf spark.kubernetes.driver.pod.name=spark-driver-driver --jars local:///opt/app/jars/* --conf spark.kubernetes.container.image=myAppImage:latest --conf spark.kubernetes.container.image.pullPolicy=Always --conf spark.kubernetes.submission.waitAppCompletion=false --conf spark.eventLog.enabled=true --conf spark.executor.extraClassPath=local:///opt/app/jars/guava-23.0.jar --conf spark.rdd.compress=true --conf spark.default.parallelism=5 --conf spark.driver.extraClassPath=local:///opt/app/jars/guava-23.0.jar --conf spark.eventLog.dir=file:/mnt/history --conf spark.kubernetes.driver.label.sparkoperator.k8s.io/app-name=spark-driver --conf spark.kubernetes.driver.label.sparkoperator.k8s.io/launched-by-spark-operator=true --conf spark.kubernetes.driver.label.sparkoperator.k8s.io/submission-id=fc6ba1dc-c7c4-4938-a6fa-03e3875652bd --conf spark.driver.cores=1 --conf spark.kubernetes.driver.limit.cores=1000m --conf spark.driver.memory=512m --conf spark.kubernetes.authenticate.driver.serviceAccountName=default --conf spark.driver.extraJavaOptions=-XX:+UseG1GC -XX:InitiatingHeapOccupancyPercent=35 --conf spark.kubernetes.driver.label.app=reference-store-publisher --conf spark.kubernetes.driver.label.version=3.1.1 --conf spark.kubernetes.driverEnv.MONITORING_ENABLED=false --conf spark.kubernetes.executor.label.sparkoperator.k8s.io/app-name=spark-driver --conf spark.kubernetes.executor.label.sparkoperator.k8s.io/launched-by-spark-operator=true --conf spark.kubernetes.executor.label.sparkoperator.k8s.io/submission-id=fc6ba1dc-c7c4-4938-a6fa-03e3875652bd --conf spark.executor.instances=1 --conf spark.executor.cores=1 --conf spark.executor.memory=512m --conf spark.kubernetes.executor.label.app=reference-store-publisher --conf spark.kubernetes.executor.label.version=2.4.5 --conf spark.executor.extraJavaOptions=-XX:+UseG1GC -XX:InitiatingHeapOccupancyPercent=35 local:///opt/app/my.jar]
This could simply be a change of behaviour of spark 3.x; spark-submit is running from the operator image and as such it's running with the same spark version as the operator itself. You can also test this outside k8s and spark-operator by spinning up a cluster and then trying spark-submit from both spark 3.x and older 2.x and check behaviour.
spark-submit arguments: [/opt/spark/bin/spark-submit --class myAppImage --master k8s://https://172.30.0.1:443 --deploy-mode cluster --conf spark.kubernetes.namespace=ns --conf spark.app.name=spark-driver --conf spark.kubernetes.driver.pod.name=spark-driver-driver --jars local:///opt/app/jars/* --conf spark.kubernetes.container.image=myAppImage --conf spark.kubernetes.container.image.pullPolicy=Always --conf spark.kubernetes.submission.waitAppCompletion=false --conf spark.eventLog.enabled=true --conf spark.executor.extraClassPath=local:///opt/app/jars/guava-23.0.jar --conf spark.rdd.compress=true --conf spark.default.parallelism=5 --conf spark.driver.extraClassPath=local:///opt/app/jars/guava-23.0.jar --conf spark.eventLog.dir=file:/mnt/history --conf spark.kubernetes.driver.label.sparkoperator.k8s.io/app-name=spark-driver --conf spark.kubernetes.driver.label.sparkoperator.k8s.io/launched-by-spark-operator=true --conf spark.kubernetes.driver.label.sparkoperator.k8s.io/submission-id=a6b11022-93bf-4f5d-9be3-6995e05e6715 --conf spark.driver.cores=1 --conf spark.kubernetes.driver.limit.cores=1000m --conf spark.driver.memory=512m --conf spark.kubernetes.authenticate.driver.serviceAccountName=default --conf spark.kubernetes.driver.label.version=2.4.5 --conf spark.driver.extraJavaOptions=-XX:+UseG1GC -XX:InitiatingHeapOccupancyPercent=35 --conf spark.kubernetes.driverEnv.MONITORING_ENABLED=false --conf spark.kubernetes.executor.label.sparkoperator.k8s.io/app-name=spark-driver --conf spark.kubernetes.executor.label.sparkoperator.k8s.io/launched-by-spark-operator=true --conf spark.kubernetes.executor.label.sparkoperator.k8s.io/submission-id=a6b11022-93bf-4f5d-9be3-6995e05e6715 --conf spark.executor.instances=1 --conf spark.executor.cores=1 --conf spark.executor.memory=512m --conf spark.kubernetes.executor.label.version=2.4.5 --conf spark.executor.extraJavaOptions=-XX:+UseG1GC -XX:InitiatingHeapOccupancyPercent=35 local:///opt/app/my.jar]
The problem is that spark-submit
3.x handles local://
values for --jars
differently than 2.4.x.
You can see the difference in the configmap created by spark-submit
.
With 2.4.x, spark-submit --jars local://my/directory/*
results in a configmap containing:
spark.jars=/my/directory/*
Whereas with 3.x, this results in:
spark.jars=local:///my/directory/*
Glob expansion does not happen in the 3.x case, resulting in the "jar not found" log message.
Important to note that the version of Spark your application is using, i.e. the version of Spark in your application's container image, does not matter. It's the version of spark-submit
that's the problem. That's why downgrading the operator to an image using Spark 2.4.x fixes the problem for you.
I've yet to find a good solution for this in Spark 3.x. Like you suggested, you can try extraClassPath
, but the behavior is not the same as --jars
. In my case, my application uses a version of Hadoop that's different from Spark's, and this collision results in a NoSuchMethodError
. This isn't a problem when --jars
is used.
Ultimately, I think I'll have to switch to an uberjar instead of jar directories, which is terrible for layer caching of the container image. It's unfortunate that Spark made this change.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.
upgraded to operator 3.1.1 (from 2.4.5), in SparkApplication yaml I have this:
now it fails on :
DependencyUtils: Local jar /opt/app/jars/* does not exist, skipping.
I verified the jars are inside the image in that location.
I changed the path to
- local:///opt/app/jars
not getting this error any more, but getting:
Error: Failed to load my.package.my.classApp: org/springframework/boot/CommandLineRunner
this class is provided: