kubeflow / spark-operator

Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.
Apache License 2.0
2.81k stars 1.38k forks source link

After upgrading: getting error jars does not exist, skipping. #1473

Closed yevsh closed 2 months ago

yevsh commented 2 years ago

upgraded to operator 3.1.1 (from 2.4.5), in SparkApplication yaml I have this:

spec:
  deps:
     jars:
        - local:///opt/app/jars/*

now it fails on :

DependencyUtils: Local jar /opt/app/jars/* does not exist, skipping.

I verified the jars are inside the image in that location.

I changed the path to

- local:///opt/app/jars

not getting this error any more, but getting:

Error: Failed to load my.package.my.classApp: org/springframework/boot/CommandLineRunner

this class is provided:

 mainApplicationFile: "local:///opt/app/jars/my.jar"
 mainClass: my.package.my.classApp
aneagoe commented 2 years ago

Please double-check permissions of the files to ensure spark can actually read them. As well, try out with the spark-pi example:

  image: gcr.io/spark-operator/spark:v3.1.1
  imagePullPolicy: Always
  mainApplicationFile: local:///opt/spark/examples/jars/spark-examples_2.12-3.1.1.jar
  mainClass: org.apache.spark.examples.SparkPi

Hope it helps.

yevsh commented 2 years ago

permissions on the jars in ///opt/app/jars?

I explicitly set on all jars -rwxr-xr-x

is this correct definition? maybe not loading spring jars

spec:
  deps:
     jars:
        - local:///opt/app/jars/*

with the example image, looks like jars are loaded, getting some other error


 ERROR MetricsConfig: Error loading configuration file /etc/metrics/conf/metrics.properties                                                             
java.io.FileNotFoundException: /etc/metrics/conf/metrics.properties (No such file or directory)    

but it's not relevant right now

yevsh commented 2 years ago

I think it might be something with operator, as i just tried to use image with spark 2.4.5 that works with operator 2.4.5 and it failed on same.

Confirmed: I went to cluster with operator 2.4.5, removed from yaml


deps:
     jars:
        - local:///opt/app/jars/*

and got the same problem!!

operator doesn't see/read/load the deps jar.

aneagoe commented 2 years ago

I doubt it has anything to do with the operator; this is just an orchestrator putting together the resources based on the spec you give it. It doesn't really care about your jars in any way... You'll need to make sure that the jars you specify in config are actually part of the image you're telling spark to use.
Also, you should be able to use older spark image with the newer operator, and vice-versa. The best approach is to start with a working example (like spark-pi) and work towards your use-case. If this used to work on 2.4.5, then use 2.4.5 images with the newer operator. Check out driver/executor options like spark.jars.packages, spark.executor.extraClassPath as well. There were quite some changes introduced in spark 3.x which can very easily break things for you and you might need to adapt the spark configuration for that. What the operator does ultimately is to spin up a spark instance for you and run spark-submit against it. It will use whatever images you tell it to use and things will work as long as you use supported spark configuration.

yevsh commented 2 years ago

the problem with the example that it's not using

deps:
     jars:
        - local:///opt/app/jars/*

Again, i took yaml with 2.4.5 image lunched it in cluster with operator 2.4.5 and it worked. Took same yaml and lunched in cluster with operator 3.1.1 and it failed.

aneagoe commented 2 years ago

So on both clusters you're using the exact same images? Please also check logs from both operators and describe both driver pods to ensure the same version is used.

yevsh commented 2 years ago

now I tried the image with spark 3.1.1 in cluster with operator 2.4.5 and also works. Dont see anything special in describes of both operators

But you are saying its ok to use operator 2.4.5 to spin spark 3.1.1?

also went to cluster with operator 2.4.5, edited they deployment and replaced the image with v1beta2-1.3.3-3.1.1, also not working.

aneagoe commented 2 years ago

It would be great if you could provide a way to reproduce this with a generic application, so anyone can test and help troubleshoot. Yes, it's fine to use operator 2.4.5 with spark 3.1.1, however, you'll be missing some features of the operator that were introduced (eg support for k8s >= 1.22 etc.). When you observe the application crash, can you please fetch logs from the operator and provide them? Perhaps they'll reveal some clue. Ultimately, the only way someone can actually help is if this is reproducible using a generic application that anyone can run and troubleshoot. Last but not least, there were some issues around metrics reported in some other tickets, even though in that case it was only affecting destruction of application. You could try to enable/disable the metrics server in the operator and observe the behaviour.

yevsh commented 2 years ago

also went to cluster with 3.1.1 operator and changed image of operator to 2.4.5 - and it worked. switched back to 3.1.1 image snd stopped working, print in log of sparkapp pod :

 WARN DependencyUtils: Local jar /opt/app/jars/* does not exist
Error: Failed to load my.package.my.classApp: org/springframework/boot/CommandLineRunner

I still suspect it has something to to with "deps". I am not using monitoring. k8s version: v1.21.9

i don't see anything special in the operator log.... do I need to look for anything?

aneagoe commented 2 years ago

So... running operator with image v1beta2-1.3.3-3.1.1 and specifying 2.4.5 image for the driver and worker pods works fine? That would absolve the operator of any suspicion... Perhaps it's worth trying with explicit jar while ensuring all dependencies are loaded (maybe via spark.executor.extraClassPath and spark.driver.extraClassPath). Or maybe try to quote the path (ie - "local:///opt/app/jars/*". Regarding the operator logs... I would compare the spark-submit stage on both working and non-working scenarios to check for differences. Also review the following:

Somehow, I suspect that the spark-submit doesn't pass local:///opt/app/jars/* properly...

yevsh commented 2 years ago

no no, operator v1beta2-1.3.3-3.1.1 is not working with 2.4.5 image for the driver.

I just replaced v1beta2-1.3.3-3.1.1 operator with 2.4.5 operator - this is what working.

So it must be operator.

Why I need to change anything in yaml if it works in 2.4.5? including quoting anything. also specifying anything in extraClassPath

log.log

IMHO it's related to deps: jars:

yevsh commented 2 years ago

works with:

 "spark.driver.extraClassPath": "local:///opt/app/jars/*"
 "spark.executor.extraClassPath": "local:///opt/app/jars/*"

so what does it mean?

bug with deps:jars? can someone check and fix?

what is the impact specifying jars this way?

aneagoe commented 2 years ago

You'll want to cross-reference the following between 2.4.5 operator and the newer one:

I0204 13:30:43.220952      10 submission.go:65] spark-submit arguments: [/opt/spark/bin/spark-submit --class my.package.my.classApp --master k8s://https://10.96.0.1:443 --deploy-mode cluster --conf spark.kubernetes.namespace=ns --conf spark.app.name=spark-driver --conf spark.kubernetes.driver.pod.name=spark-driver-driver --jars local:///opt/app/jars/* --conf spark.kubernetes.container.image=myAppImage:latest --conf spark.kubernetes.container.image.pullPolicy=Always --conf spark.kubernetes.submission.waitAppCompletion=false --conf spark.eventLog.enabled=true --conf spark.executor.extraClassPath=local:///opt/app/jars/guava-23.0.jar --conf spark.rdd.compress=true --conf spark.default.parallelism=5 --conf spark.driver.extraClassPath=local:///opt/app/jars/guava-23.0.jar --conf spark.eventLog.dir=file:/mnt/history --conf spark.kubernetes.driver.label.sparkoperator.k8s.io/app-name=spark-driver --conf spark.kubernetes.driver.label.sparkoperator.k8s.io/launched-by-spark-operator=true --conf spark.kubernetes.driver.label.sparkoperator.k8s.io/submission-id=fc6ba1dc-c7c4-4938-a6fa-03e3875652bd --conf spark.driver.cores=1 --conf spark.kubernetes.driver.limit.cores=1000m --conf spark.driver.memory=512m --conf spark.kubernetes.authenticate.driver.serviceAccountName=default --conf spark.driver.extraJavaOptions=-XX:+UseG1GC -XX:InitiatingHeapOccupancyPercent=35 --conf spark.kubernetes.driver.label.app=reference-store-publisher --conf spark.kubernetes.driver.label.version=3.1.1 --conf spark.kubernetes.driverEnv.MONITORING_ENABLED=false --conf spark.kubernetes.executor.label.sparkoperator.k8s.io/app-name=spark-driver --conf spark.kubernetes.executor.label.sparkoperator.k8s.io/launched-by-spark-operator=true --conf spark.kubernetes.executor.label.sparkoperator.k8s.io/submission-id=fc6ba1dc-c7c4-4938-a6fa-03e3875652bd --conf spark.executor.instances=1 --conf spark.executor.cores=1 --conf spark.executor.memory=512m --conf spark.kubernetes.executor.label.app=reference-store-publisher --conf spark.kubernetes.executor.label.version=2.4.5 --conf spark.executor.extraJavaOptions=-XX:+UseG1GC -XX:InitiatingHeapOccupancyPercent=35 local:///opt/app/my.jar]

This could simply be a change of behaviour of spark 3.x; spark-submit is running from the operator image and as such it's running with the same spark version as the operator itself. You can also test this outside k8s and spark-operator by spinning up a cluster and then trying spark-submit from both spark 3.x and older 2.x and check behaviour.

yevsh commented 2 years ago

spark-submit arguments: [/opt/spark/bin/spark-submit --class myAppImage --master k8s://https://172.30.0.1:443 --deploy-mode cluster --conf spark.kubernetes.namespace=ns --conf spark.app.name=spark-driver --conf spark.kubernetes.driver.pod.name=spark-driver-driver --jars local:///opt/app/jars/* --conf spark.kubernetes.container.image=myAppImage --conf spark.kubernetes.container.image.pullPolicy=Always --conf spark.kubernetes.submission.waitAppCompletion=false --conf spark.eventLog.enabled=true --conf spark.executor.extraClassPath=local:///opt/app/jars/guava-23.0.jar --conf spark.rdd.compress=true --conf spark.default.parallelism=5 --conf spark.driver.extraClassPath=local:///opt/app/jars/guava-23.0.jar --conf spark.eventLog.dir=file:/mnt/history --conf spark.kubernetes.driver.label.sparkoperator.k8s.io/app-name=spark-driver --conf spark.kubernetes.driver.label.sparkoperator.k8s.io/launched-by-spark-operator=true --conf spark.kubernetes.driver.label.sparkoperator.k8s.io/submission-id=a6b11022-93bf-4f5d-9be3-6995e05e6715 --conf spark.driver.cores=1 --conf spark.kubernetes.driver.limit.cores=1000m --conf spark.driver.memory=512m --conf spark.kubernetes.authenticate.driver.serviceAccountName=default --conf spark.kubernetes.driver.label.version=2.4.5 --conf spark.driver.extraJavaOptions=-XX:+UseG1GC -XX:InitiatingHeapOccupancyPercent=35 --conf spark.kubernetes.driverEnv.MONITORING_ENABLED=false --conf spark.kubernetes.executor.label.sparkoperator.k8s.io/app-name=spark-driver --conf spark.kubernetes.executor.label.sparkoperator.k8s.io/launched-by-spark-operator=true --conf spark.kubernetes.executor.label.sparkoperator.k8s.io/submission-id=a6b11022-93bf-4f5d-9be3-6995e05e6715 --conf spark.executor.instances=1 --conf spark.executor.cores=1 --conf spark.executor.memory=512m --conf spark.kubernetes.executor.label.version=2.4.5 --conf spark.executor.extraJavaOptions=-XX:+UseG1GC -XX:InitiatingHeapOccupancyPercent=35 local:///opt/app/my.jar]

cartermckinnon commented 2 years ago

The problem is that spark-submit 3.x handles local:// values for --jars differently than 2.4.x.

You can see the difference in the configmap created by spark-submit.

With 2.4.x, spark-submit --jars local://my/directory/* results in a configmap containing:

spark.jars=/my/directory/*

Whereas with 3.x, this results in:

spark.jars=local:///my/directory/*

Glob expansion does not happen in the 3.x case, resulting in the "jar not found" log message.

Important to note that the version of Spark your application is using, i.e. the version of Spark in your application's container image, does not matter. It's the version of spark-submit that's the problem. That's why downgrading the operator to an image using Spark 2.4.x fixes the problem for you.

I've yet to find a good solution for this in Spark 3.x. Like you suggested, you can try extraClassPath, but the behavior is not the same as --jars. In my case, my application uses a version of Hadoop that's different from Spark's, and this collision results in a NoSuchMethodError. This isn't a problem when --jars is used.

Ultimately, I think I'll have to switch to an uberjar instead of jar directories, which is terrible for layer caching of the container image. It's unfortunate that Spark made this change.

github-actions[bot] commented 3 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] commented 2 months ago

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.