kubeflow / spark-operator

Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.
Apache License 2.0
2.79k stars 1.38k forks source link

java.lang.ClassNotFoundException: org.apache.spark.sql.delta.files.DelayedCommitProtocol #1533

Closed RodrigoBorges93 closed 1 month ago

RodrigoBorges93 commented 2 years ago

Hi there!

I'm trying to save a delta file from a csv in pyspark. I have added the following packages:

Spark operator Image: gcr.io/spark-operator/spark-py:v3.1.1-hadoop3

I'm able to save the file as parquet, but when I try to save as Delta, the following error happens:

WARN TaskSetManager: Lost task 0.0 in stage 5.0 (TID 7) (10.244.3.46 executor 1): java.lang.ClassNotFoundException: org.apache.spark.sql.delta.files.DelayedCommitProtocol at java.base/java.net.URLClassLoader.findClass(Unknown Source) at java.base/java.lang.ClassLoader.loadClass(Unknown Source) at java.base/java.lang.ClassLoader.loadClass(Unknown Source) at java.base/java.lang.Class.forName0(Native Method) at java.base/java.lang.Class.forName(Unknown Source) at org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:68) at java.base/java.io.ObjectInputStream.readNonProxyDesc(Unknown Source) at java.base/java.io.ObjectInputStream.readClassDesc(Unknown Source) at java.base/java.io.ObjectInputStream.readOrdinaryObject(Unknown Source) at java.base/java.io.ObjectInputStream.readObject0(Unknown Source) at java.base/java.io.ObjectInputStream.readArray(Unknown Source) at java.base/java.io.ObjectInputStream.readObject0(Unknown Source) at java.base/java.io.ObjectInputStream.defaultReadFields(Unknown Source) at java.base/java.io.ObjectInputStream.readSerialData(Unknown Source) at java.base/java.io.ObjectInputStream.readOrdinaryObject(Unknown Source) at java.base/java.io.ObjectInputStream.readObject0(Unknown Source) at java.base/java.io.ObjectInputStream.defaultReadFields(Unknown Source) at java.base/java.io.ObjectInputStream.readSerialData(Unknown Source) at java.base/java.io.ObjectInputStream.readOrdinaryObject(Unknown Source) at java.base/java.io.ObjectInputStream.readObject0(Unknown Source) at java.base/java.io.ObjectInputStream.readObject(Unknown Source) at java.base/java.io.ObjectInputStream.readObject(Unknown Source) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76) at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:115) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:83) at org.apache.spark.scheduler.Task.run(Task.scala:131) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.base/java.lang.Thread.run(Unknown Source)

Does someone know how to fix this? Is this a package version problem?

allenhaozi commented 2 years ago

hi @RodrigoBorges93 Try adding the following jars together to see if they work

RodrigoBorges93 commented 2 years ago

Hi @allenhaozi Thanks for your help. I have added the jars you mentioned and now we have this error:

py4j.protocol.Py4JJavaError: An error occurred while calling o70.sessionState. : java.lang.IncompatibleClassChangeError: class org.apache.spark.sql.catalyst.TimeTravel can not implement org.apache.spark.sql.catalyst.plans.logical.LeafNode, because it is not an interface (org.apache.spark.sql.catalyst.plans.logical.LeafNode is in unnamed module of loader 'app') at java.base/java.lang.ClassLoader.defineClass1(Native Method) at java.base/java.lang.ClassLoader.defineClass(Unknown Source) at java.base/java.security.SecureClassLoader.defineClass(Unknown Source) at java.base/java.net.URLClassLoader.defineClass(Unknown Source) at java.base/java.net.URLClassLoader$1.run(Unknown Source) at java.base/java.net.URLClassLoader$1.run(Unknown Source) at java.base/java.security.AccessController.doPrivileged(Native Method) at java.base/java.net.URLClassLoader.findClass(Unknown Source) at java.base/java.lang.ClassLoader.loadClass(Unknown Source) at java.base/java.lang.ClassLoader.loadClass(Unknown Source) at io.delta.sql.parser.DeltaSqlParser.<init>(DeltaSqlParser.scala:71) at io.delta.sql.DeltaSparkSessionExtension.$anonfun$apply$1(DeltaSparkSessionExtension.scala:78) at org.apache.spark.sql.SparkSessionExtensions.$anonfun$buildParser$1(SparkSessionExtensions.scala:239) at scala.collection.IndexedSeqOptimized.foldLeft(IndexedSeqOptimized.scala:60) at scala.collection.IndexedSeqOptimized.foldLeft$(IndexedSeqOptimized.scala:68) at scala.collection.mutable.ArrayBuffer.foldLeft(ArrayBuffer.scala:49) at org.apache.spark.sql.SparkSessionExtensions.buildParser(SparkSessionExtensions.scala:238) at org.apache.spark.sql.internal.BaseSessionStateBuilder.sqlParser$lzycompute(BaseSessionStateBuilder.scala:124) at org.apache.spark.sql.internal.BaseSessionStateBuilder.sqlParser(BaseSessionStateBuilder.scala:123) at org.apache.spark.sql.internal.BaseSessionStateBuilder.build(BaseSessionStateBuilder.scala:341) at org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$instantiateSessionState(SparkSession.scala:1142) at org.apache.spark.sql.SparkSession.$anonfun$sessionState$2(SparkSession.scala:156) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:152) at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:149) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.base/java.lang.reflect.Method.invoke(Unknown Source) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.base/java.lang.Thread.run(Unknown Source)

Do you know if we have to do something else?

RodrigoBorges93 commented 2 years ago

We are also having bintray server errors:

`:: problems summary :: :::: ERRORS SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/org/apache/hadoop/hadoop-main/3.0.0/hadoop-main-3.0.0.jar

    SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/org/apache/hadoop/hadoop-project/3.0.0/hadoop-project-3.0.0.jar

    SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/org/apache/hadoop/hadoop-azure/3.0.0/hadoop-azure-3.0.0-sources.jar

    SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/org/apache/hadoop/hadoop-azure/3.0.0/hadoop-azure-3.0.0-src.jar

    SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/org/apache/hadoop/hadoop-azure/3.0.0/hadoop-azure-3.0.0-javadoc.jar

    SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/org/apache/hadoop/hadoop-project-dist/3.0.0/hadoop-project-dist-3.0.0.jar

    SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/org/apache/hadoop/hadoop-common/3.0.0/hadoop-common-3.0.0-sources.jar

    SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/org/apache/hadoop/hadoop-common/3.0.0/hadoop-common-3.0.0-src.jar

    SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/org/apache/hadoop/hadoop-common/3.0.0/hadoop-common-3.0.0-javadoc.jar

    SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/org/apache/hadoop/hadoop-annotations/3.0.0/hadoop-annotations-3.0.0-sources.jar

    SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/org/apache/hadoop/hadoop-annotations/3.0.0/hadoop-annotations-3.0.0-src.jar

    SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/org/apache/hadoop/hadoop-annotations/3.0.0/hadoop-annotations-3.0.0-javadoc.jar

    SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/org/sonatype/oss/oss-parent/7/oss-parent-7.jar

    SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/com/google/guava/guava-parent/11.0.2/guava-parent-11.0.2.jar

    SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/org/apache/apache/4/apache-4.jar

    SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/org/apache/commons/commons-parent/11/commons-parent-11.jar

    SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/org/apache/apache/9/apache-9.jar

    SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/org/apache/commons/commons-parent/24/commons-parent-24.jar

    SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/org/apache/apache/13/apache-13.jar

    SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/org/apache/httpcomponents/project/7/project-7.jar

    SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/org/apache/httpcomponents/httpcomponents-client/4.5.2/httpcomponents-client-4.5.2.jar

    SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/org/apache/httpcomponents/httpcomponents-core/4.4.4/httpcomponents-core-4.4.4.jar

    SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/org/apache/commons/commons-parent/28/commons-parent-28.jar

    SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/org/apache/commons/commons-parent/25/commons-parent-25.jar

    SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/org/apache/commons/commons-parent/23/commons-parent-23.jar

    SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/org/apache/apache/16/apache-16.jar

    SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/org/apache/commons/commons-parent/39/commons-parent-39.jar

    SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/net/java/jvnet-parent/3/jvnet-parent-3.jar

    SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/org/eclipse/jetty/jetty-parent/25/jetty-parent-25.jar

    SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/org/eclipse/jetty/jetty-project/9.3.19.v20170502/jetty-project-9.3.19.v20170502.jar

    SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/net/java/jvnet-parent/4/jvnet-parent-4.jar

    SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/com/sun/jersey/jersey-project/1.19/jersey-project-1.19.jar

    SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/com/sun/xml/bind/jaxb-impl/2.2.3-1/jaxb-impl-2.2.3-1-javadoc.jar

    SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/org/apache/apache/7/apache-7.jar

    SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/org/apache/commons/commons-parent/17/commons-parent-17.jar

    SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/org/apache/apache/18/apache-18.jar

    SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/org/apache/commons/commons-parent/41/commons-parent-41.jar

    SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/org/apache/commons/commons-parent/37/commons-parent-37.jar

    SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/org/slf4j/slf4j-parent/1.7.25/slf4j-parent-1.7.25.jar

    SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/org/apache/apache/10/apache-10.jar

    SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/org/apache/avro/avro-toplevel/1.7.7/avro-toplevel-1.7.7.jar

    SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/org/apache/avro/avro-parent/1.7.7/avro-parent-1.7.7.jar

    SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/org/codehaus/codehaus-parent/1/codehaus-parent-1.jar

    SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/com/thoughtworks/paranamer/paranamer-parent/2.3/paranamer-parent-2.3.jar

    SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/com/google/google/1/google-1.jar

    SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/org/apache/hadoop/hadoop-auth/3.0.0/hadoop-auth-3.0.0-javadoc.jar

    SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/net/minidev/minidev-parent/2.3/minidev-parent-2.3.jar

    SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/org/ow2/ow2/1.3/ow2-1.3.jar

    SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/org/ow2/asm/asm-parent/5.0.4/asm-parent-5.0.4.jar

    SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/jline/jline/0.9.94/jline-0.9.94-javadoc.jar

    SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/org/sonatype/oss/oss-parent/9/oss-parent-9.jar

    SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/org/apache/curator/apache-curator/2.12.0/apache-curator-2.12.0.jar

    SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/org/apache/curator/curator-framework/2.12.0/curator-framework-2.12.0-javadoc.jar

    SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/org/apache/curator/curator-client/2.12.0/curator-client-2.12.0-javadoc.jar

    SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/org/apache/kerby/kerby-all/1.0.1/kerby-all-1.0.1.jar

    SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/org/apache/kerby/kerby-kerb/1.0.1/kerby-kerb-1.0.1.jar

    SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/org/apache/kerby/kerby-common/1.0.1/kerby-common-1.0.1.jar

    SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/org/apache/kerby/kerby-provider/1.0.1/kerby-provider-1.0.1.jar

    SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/org/sonatype/oss/oss-parent/6/oss-parent-6.jar

    SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/org/apache/curator/curator-recipes/2.12.0/curator-recipes-2.12.0-javadoc.jar

    SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/org/apache/apache/17/apache-17.jar

    SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/org/apache/htrace/htrace/4.1.0-incubating/htrace-4.1.0-incubating.jar

    SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/com/fasterxml/oss-parent/25/oss-parent-25.jar

    SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/com/fasterxml/jackson/jackson-parent/2.7/jackson-parent-2.7.jar

    SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/com/fasterxml/oss-parent/24/oss-parent-24.jar

    SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/javax/servlet/jsp/jsp-api/2.1/jsp-api-2.1-javadoc.jar

    SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/com/microsoft/azure/azure-bom/0.8.0/azure-bom-0.8.0.jar

    SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/com/microsoft/azure/azure/0.8.0/azure-0.8.0.jar

    SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/org/antlr/antlr4-master/4.8/antlr4-master-4.8.jar

    SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/org/antlr/antlr-master/3.5.2/antlr-master-3.5.2.jar

    SERVER ERROR: Bad Gateway url=https://dl.bintray.com/spark-packages/maven/org/glassfish/json/1.0.4/json-1.0.4.jar

`

allenhaozi commented 2 years ago

It looks like a version compatibility issue

I build our own Spark image base on spark-3.2.1-bin-hadoop3.2.tgz

and the following jars:

RodrigoBorges93 commented 2 years ago

I'll try to build this image as well.

Thanks!

praveenvanam1 commented 2 years ago

@RodrigoBorges93 -did it work for you? I am also seeing the same issue.

I have Spark version: 3.1.3 Scala versions: 2.12.10

allenhaozi commented 2 years ago

I'll put one into docker Hub, you can try it if u want allenhaozi/deltalake-1.2.1-py-3.8:v0.1.0

github-actions[bot] commented 2 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] commented 1 month ago

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.