kubeflow / spark-operator

Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.
Apache License 2.0
2.74k stars 1.36k forks source link

java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem #1508

Open farshadsm opened 2 years ago

farshadsm commented 2 years ago

Hi, I get the following error when I submit my spark job to k8s cluster using Spark Operator.

java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem

I've put my yaml file configs below. I've made sure "hadoop-aws" and "aws-java-sdk" have compatible versions. I was able to successfully run the job for the "pi.py" script that is available in the "/opt/spark/examples/src/main/python/pi.py" path of the given image container of the spark operator. However, when spark in my python script wants to read a CSV file from an AWS S3 bucket, I get the error message shown above. I've tried so many different versions of hadoop-aws, none of them resolved my issue. Could you please help me out?

apiVersion: "sparkoperator.k8s.io/v1beta2" kind: SparkApplication metadata: name: test-spark-hadoop-aws-3.2.3 namespace: default spec: deps: repositories:

farshadsm commented 2 years ago

Following on my previous comment, I should mention that the kubectl logs show the following download statistics:

---------------------------------------------------------------------
|                  |            modules            ||   artifacts   |
|       conf       | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
|      default     |  253  |  251  |  251  |   2   ||  251  |  251  |
---------------------------------------------------------------------

It also shows the following lines:

:: problems summary :: :::: ERRORS SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/org/apache/hadoop/hadoop-main/3.2.3/hadoop-main-3.2.3.jar

SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/org/apache/hadoop/hadoop-project/3.2.3/hadoop-project-3.2.3.jar

SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/com/amazonaws/aws-java-sdk-pom/1.11.901/aws-java-sdk-pom-1.11.901.jar

SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/com/amazonaws/aws-java-sdk/1.11.901/aws-java-sdk-1.11.901-sources.jar

SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/com/amazonaws/aws-java-sdk/1.11.901/aws-java-sdk-1.11.901-src.jar

SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/com/amazonaws/aws-java-sdk/1.11.901/aws-java-sdk-1.11.901-javadoc.jar

SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/org/apache/apache/13/apache-13.jar

SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/org/apache/commons/commons-parent/28/commons-parent-28.jar

SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/org/apache/apache/21/apache-21.jar

SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/org/apache/httpcomponents/httpcomponents-parent/11/httpcomponents-parent-11.jar

SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/org/apache/httpcomponents/httpcomponents-client/4.5.13/httpcomponents-client-4.5.13.jar

SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/org/apache/httpcomponents/httpcomponents-core/4.4.13/httpcomponents-core-4.4.13.jar

SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/org/apache/apache/18/apache-18.jar

SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/org/apache/commons/commons-parent/42/commons-parent-42.jar

SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/com/fasterxml/oss-parent/24/oss-parent-24.jar

SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/com/fasterxml/jackson/jackson-parent/2.6.2/jackson-parent-2.6.2.jar

SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/com/fasterxml/jackson/core/jackson-databind/2.6.7.3/jackson-databind-2.6.7.3-javadoc.jar

SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/com/fasterxml/oss-parent/23/oss-parent-23.jar

SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/com/fasterxml/jackson/jackson-parent/2.6.1/jackson-parent-2.6.1.jar

SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/org/sonatype/oss/oss-parent/9/oss-parent-9.jar

SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/io/netty/netty-parent/4.1.48.Final/netty-parent-4.1.48.Final.jar

SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/com/amazonaws/aws-java-sdk-models/1.11.901/aws-java-sdk-models-1.11.901-javadoc.jar

SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/com/amazonaws/aws-java-sdk-pom/1.11.22/aws-java-sdk-pom-1.11.22.jar

:: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS :: retrieving :: org.apache.spark#spark-submit-parent-0d835d60-447d-4a98-941b-e22aaa69903c confs: [default] 251 artifacts copied, 0 already retrieved (374746kB/445ms) 22/04/09 21:34:55 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 22/04/09 21:34:56 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties

jkleckner commented 2 years ago

This might be related?

JFrog to Shut down JCenter and Bintray https://www.infoq.com/news/2021/02/jfrog-jcenter-bintray-closure/

farshadsm commented 2 years ago

Thanks for pointing me to the link for Bintray. What should I put into my yaml file to not downloading any thing from JCenter? In my yaml file, I set "https://repo.maven.apache.org/maven2/" for ".spec.deps.repositories". I was hoping with this setting, no reference to Bintray would be made.

Zhang-Aoqi commented 2 years ago

Have you solved this problem yet? I met A similar problem with you. I can find Class org.apache.hadoop.fs.s3a.S3AFileSystem now, but I can't find java.lang.ClassNotFoundException: org.apache.hadoop.fs.StreamCapabilities

hyungryuk commented 2 years ago

@Zhang-Aoqi Same Issue here. and i figured it out finally! that error does not come from your spark app. it's happend with your spark-operator pod. In my case, my spark app depends on hadoop 3.2 version, but spark-operator pod which i installed using helm has hadoop 2.7 jar files. please check if yours fine :)

Zhang-Aoqi commented 2 years ago

@Zhang-Aoqi Same Issue here. and i figured it out finally! that error does not come from your spark app. it's happend with your spark-operator pod. In my case, my spark app depends on hadoop 3.2 version, but spark-operator pod which i installed using helm has hadoop 2.7 jar files.

Oh, I just solved this problem too. I re-swapped the spark version running in the pod. It seems that we are doing similar tasks, if you encounter any problems, welcome to communicate.

hyungryuk commented 2 years ago

@Zhang-Aoqi Good to know! Thanx. So Did you change your spark version?

Zhang-Aoqi commented 2 years ago

@hyungryuk I used Spark-3.0.0-bin-hadoop-3.2 as the image in pod, and added the aws-java-sdk-bundle-1.11.375.jar, hadoop-aws-3.2.0.jar. I'm a newbie. I don't know if you can understand me.

hyungryuk commented 2 years ago

@Zhang-Aoqi Alright! if this error happens again, then try this image to build spark-operator gcr.io/spark-operator/spark:v3.1.1-hadoop3

kind of simmilar issue here : https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/issues/1334

farshadsm commented 2 years ago

@hyungryuk Thanks for providing your comments. I'll check with our engineer who set up the Kubernetes cluster and installed the spark pod and I'll let you everyone in this thread about the result. I hope your solution resolves my issue.

farshadsm commented 2 years ago

@Zhang-Aoqi Thanks for your comments. I'll work on the resolutions suggested by @hyungryuk and will let you know the outcome.

farshadsm commented 2 years ago

@hyungryuk I've noticed that we have used "spark-operator-chart-1.1.19" helm chart to install spark-operator on the k8s cluster. How can I find the version of hadoop jar files that are created by this helm chart?

hyungryuk commented 2 years ago

@farshadsm just access to your spark-operator pod and run following command :) ls /opt/spark/jars | grep hadoop

Zhang-Aoqi commented 2 years ago

@Zhang-Aoqi Alright! if this error happens again, then try this image to build spark-operator gcr.io/spark-operator/spark:v3.1.1-hadoop3

kind of simmilar issue here : #1334

Ok, thank you

allenhaozi commented 2 years ago

I met same error.

My solution is:

I build our own Spark image base on spark-3.2.1-bin-hadoop3.2.tgz

then add these jars under $SPARK_HOME/jars/

At present, it is still in the test stage and can work. No problems have been found for the time being.

Hope it helps

upMKuhn commented 2 years ago

Hi would you please mind sharing the docker image. I tried copying the Jars as suggested, but I still encounter the issue :(

upMKuhn commented 2 years ago

I created a public image :) registry.gitlab.com/upmkuhn/spark-operator:v3-2-hadoop3-aws-2

allenhaozi commented 2 years ago

@upMKuhn allenhaozi/base-pyspark-3.2.1-py-v3.8:v0.1.0

Xxxxxyd commented 2 years ago

@upMKuhn allenhaozi/base-pyspark-3.2.1-py-v3.8:v0.1.0

测试可以的,我自己也打了个镜像,比较重。能参考下你的Dockerfile么,感谢大佬。

Wh1isper commented 1 year ago

@Xxxxxyd https://github.com/Wh1isper/spark-build Here is my dockerfile and docs try: wh1isper/spark-executor:3.4.1

I am developing sparglim as tools to config pyspark quickly and friendly for pyspark-based app see: https://github.com/Wh1isper/sparglim

aimendenche commented 1 year ago

@Zhang-Aoqi Good to know! Thanx. So Did you change your spark version?

i'm having the same error but i couldn't resolve it, could you help please ?

staceystaceybangbang007 commented 1 year ago

SICKER THAN YOUR BABA

On Wed, 2 Aug 2023, 13:48 aimendenche, @.***> wrote:

@Zhang-Aoqi https://github.com/Zhang-Aoqi Good to know! Thanx. So Did you change your spark version?

i'm having the same error but i couldn't resolve it, could you help please ?

— Reply to this email directly, view it on GitHub https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/issues/1508#issuecomment-1662151433, or unsubscribe https://github.com/notifications/unsubscribe-auth/A435L3BGJZ5F3PHRTZX3RTDXTJECZANCNFSM5S727SYA . You are receiving this because you are subscribed to this thread.Message ID: <GoogleCloudPlatform/spark-on-k8s-operator/issues/1508/1662151433@ github.com>

ViktorGlushak commented 1 year ago

I had same problems (org.apache.hadoop.fs.s3a.S3AFileSystem not found). When I tried:

      deps:
        files:
          - "s3a://k8s-3c172e28d7da2e-bucket/test.jar"

Even added jars files inside image: "image-registry/spark-base-image" did not work. But I fixed this problem when I added necessary jars inside SPARK-OPERATOR pod. You can rebuild you Docker image by adding jars. I rebuild it:

FROM ghcr.io/googlecloudplatform/spark-operator:v1beta2-1.3.7-3.1.1

ENV SPARK_HOME /opt/spark

RUN curl https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/2.7.4/hadoop-aws-2.7.4.jar -o ${SPARK_HOME}/jars/hadoop-aws-2.7.4.jar
RUN curl https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk/1.7.4/aws-java-sdk-1.7.4.jar -o ${SPARK_HOME}/jars/aws-java-sdk-1.7.4.jar

In spark-operator inside has hadoop version 2.7 and we need use all dependencies exactly for this version on https://mvnrepository.com/

First for tests I went to inside spark-operator pod by command

kubectl exec -it spark-operator-fb8f779cb-gt657 -n spark-operator -- bash

Then inside my spark-operator pod I go to /opt/spark/jars and upload jars (for example curl https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/2.7.4/hadoop-aws-2.7.4.jar)

Then I tried apply my manifest with deps.files and it is worked.

vikas-saxena02 commented 1 month ago

@ViktorGlushak is correct, you will need to have hadoop-aws and either of hadoop-common/hadoop-client to be included as a dependency in your project to access the S3A Magic committer.