nightscape / spark-excel

A Spark plugin for reading and writing Excel files
Apache License 2.0
469 stars 147 forks source link

Dependency issues with Spark's built-in commons-compress #93

Closed jwooden1 closed 5 years ago

jwooden1 commented 6 years ago

I can use the library when I run spark on my local windows machine and read excel files on the same machine. However, when I upload the files to WASB on Azure and use HDInsight cluster for running spark jobs (either local or cluster mode), I get the following error:

java.lang.IllegalArgumentException: InputStream of class class org.apache.commons.compress.archivers.zip.ZipArchiveInputStream is not implementing InputStreamStatistics. at org.apache.poi.openxml4j.util.ZipArchiveThresholdInputStream.<init>(ZipArchiveThresholdInputStream.java:63) at org.apache.poi.openxml4j.opc.internal.ZipHelper.openZipStream(ZipHelper.java:180) at org.apache.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:104) at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:298) at org.apache.poi.xssf.usermodel.XSSFWorkbookFactory.createWorkbook(XSSFWorkbookFactory.java:129) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.poi.ss.usermodel.WorkbookFactory.createWorkbook(WorkbookFactory.java:314) at org.apache.poi.ss.usermodel.WorkbookFactory.createXSSFWorkbook(WorkbookFactory.java:296) at org.apache.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:214) at org.apache.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:180) at com.crealytics.spark.excel.ExcelRelation$$anonfun$openWorkbook$2$$anonfun$apply$4.apply(ExcelRelation.scala:66) at com.crealytics.spark.excel.ExcelRelation$$anonfun$openWorkbook$2$$anonfun$apply$4.apply(ExcelRelation.scala:66) at scala.Option.fold(Option.scala:158) at com.crealytics.spark.excel.ExcelRelation$$anonfun$openWorkbook$2.apply(ExcelRelation.scala:66) at com.crealytics.spark.excel.ExcelRelation$$anonfun$openWorkbook$2.apply(ExcelRelation.scala:66) at scala.Option.getOrElse(Option.scala:121) at com.crealytics.spark.excel.ExcelRelation.openWorkbook(ExcelRelation.scala:64) at com.crealytics.spark.excel.ExcelRelation.excerpt$lzycompute(ExcelRelation.scala:71) at com.crealytics.spark.excel.ExcelRelation.excerpt(ExcelRelation.scala:70) at com.crealytics.spark.excel.ExcelRelation$$anonfun$inferSchema$1.apply(ExcelRelation.scala:264) at com.crealytics.spark.excel.ExcelRelation$$anonfun$inferSchema$1.apply(ExcelRelation.scala:263) at scala.Option.getOrElse(Option.scala:121) at com.crealytics.spark.excel.ExcelRelation.inferSchema(ExcelRelation.scala:263) at com.crealytics.spark.excel.ExcelRelation.<init>(ExcelRelation.scala:91) at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:39) at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:14) at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:8) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:309) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:156) ... 53 elided

nightscape commented 6 years ago

I had the same problem a few days ago, but haven't found a proper solution. The problem is that Spark comes bundled with a rather outdated version of commons-compress and POI needs a newer version. In principle it should be possible to override the JARs bundled with Spark with user-provided ones, but I haven't yet managed to successfully do so. In case you find a solution, please post it here 👍 In the mean time, you could try older versions of spark-excel maybe the pre-0.10 versions work with the older version of commons-compress.

jornfranke commented 6 years ago

I had the same issue (but not for spark-excel, another software). You need to shade the dependencies to commons-compress so that your Spark application uses the new version of commons-compress. You can do this in Java with the Maven shade plugin or in Scala with the assembly plugin (https://github.com/sbt/sbt-assembly) of SBT. Then, you can define in your build.sbt a rule to shade the commons compress (https://github.com/sbt/sbt-assembly#shading).

If you want to use R and Python then maybe @nightscape needs to shade it directly in the spark-excel module that is published on Maven.

The other way "override the Jars bundled with Spark" is in this case not possible, because it is core part of Spark. However, shading it is not so bad in this case. I recommend also to create a JIRA issue for this with the Spark project to update commons-compress (the old version is vulnerable to several attacks).

nightscape commented 6 years ago

I just released 0.10.1 and 0.11.0-beta2 which shade commons-compress and should hopefully fix this problem. Can you give it a try and tell me if it worked?

hbenzineb commented 6 years ago

Hi @nightscape I m using 0.11.0-beta2 and I still have the same Error When I use a dependency to commons-compress, I have this message : _

diagnostics: User class threw exception: java.lang.IllegalArgumentException: InputStream of class class org.apache.commons.compress.archivers.zip.ZipArchiveInputStream is not implementing InputStreamStatistics.

When I dont use the dependency, I have this :

diagnostics: User class threw exception: java.lang.NoClassDefFoundError: org/apache/commons/compress/utils/InputStreamStatistics

_ As a reminder, I try to write the contents of several dataframes in several sheets of the same Excel file

jornfranke commented 6 years ago

@nightscape I think you don't include commons-compress explicitly in the resulting jar of the spark-excel module. In this case the shading rules will not apply. See fat jar: https://github.com/sbt/sbt-assembly.

nightscape commented 6 years ago

Just trying another approach. Can someone check 0.11.0-beta3?

hbenzineb commented 6 years ago

@nightscape : it's OK :) Thanks

nightscape commented 6 years ago

Ok, then I'll backport this to 0.10 and release 0.11 from the beta version.

nightscape commented 6 years ago

Fixed in 0.10.2 and 0.11.0-beta3.

jwooden1 commented 6 years ago

fix is working for 0.10.2, but not in 0.11.0-beta3. I get this error in 0.11.0-beta3. scala.MatchError: Map(treatemptyvaluesasnulls -> false, path -> /unique.xlsx, useheader -> true, endcolumn -> 8, inferschema -> true, startcolumn -> 0, sheetname -> input) (of class org.apache.spark.sql.catalyst.util.CaseInsensitiveMap) at com.crealytics.spark.excel.DataLocator$.apply(DataLocator.scala:52) at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:29) at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:18) at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:12) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:309) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:156) ... 53 elided Looking at the code, it looks to me it is due to making dataaddress a mandetory filed? what is it anyway? Also, I think it is creating a side-effect, because if I pass null when reading, there is no err in read, but it does not read the specified sheet-- looks that it just read the first sheet.

abhishek-bhatt3 commented 5 years ago

fix is working for 0.10.2, but not in 0.11.0-beta3. I get this error in 0.11.0-beta3. scala.MatchError: Map(treatemptyvaluesasnulls -> false, path -> /unique.xlsx, useheader -> true, endcolumn -> 8, inferschema -> true, startcolumn -> 0, sheetname -> input) (of class org.apache.spark.sql.catalyst.util.CaseInsensitiveMap) at com.crealytics.spark.excel.DataLocator$.apply(DataLocator.scala:52) at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:29) at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:18) at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:12) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:309) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:156) ... 53 elided Looking at the code, it looks to me it is due to making dataaddress a mandetory filed? what is it anyway? Also, I think it is creating a side-effect, because if I pass null when reading, there is no err in read, but it does not read the specified sheet-- looks that it just read the first sheet.

I am facing the same error in 0.11.0. Any update on this?

jagadeesh427 commented 5 years ago

Exception in thread "main" scala.MatchError: Map(treatemptyvaluesasnulls -> true, location -> hdfs://nameservice1/flatfiles/raw/500a_map_e.xlsx, useheader -> true, inferschema -> true, addcolorcolumns -> false, sheetname -> _500a_map_e) (of class org.apache.spark.sql.catalyst.util.CaseInsensitiveMap)

I am facing above issue.

dependencies used .

com.crealytics spark-excel_2.10 0.8.3
</dependencies>

can anyone help?

jagadeesh427 commented 5 years ago

solved the issue :

used --packages com.crealytics:spark-excel_2.11:0.10.2

worked fine

nightscape commented 5 years ago

I can reproduce this locally now. The problem seems to be that despite shading org.apache.commons.compress this line seems to be calling the constructor of the unshaded ZipArchiveInputStream. Trying to find out what's happening...

nightscape commented 5 years ago

Not understanding it... The exception says the following:

java.lang.IllegalArgumentException: InputStream of class class org.apache.commons.compress.archivers.zip.ZipArchiveInputStream is not implementing InputStreamStatistics.
  org.apache.poi.openxml4j.util.ZipArchiveThresholdInputStream.<init>(ZipArchiveThresholdInputStream.java:63)
  org.apache.poi.openxml4j.opc.internal.ZipHelper.openZipStream(ZipHelper.java:180)
  org.apache.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:104)
  org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:298)
  org.apache.poi.xssf.usermodel.XSSFWorkbookFactory.createWorkbook(XSSFWorkbookFactory.java:129)
  sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
  sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  java.lang.reflect.Method.invoke(Method.java:498)
  org.apache.poi.ss.usermodel.WorkbookFactory.createWorkbook(WorkbookFactory.java:314)
  org.apache.poi.ss.usermodel.WorkbookFactory.createXSSFWorkbook(WorkbookFactory.java:296)
  org.apache.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:214)
  org.apache.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:180)
  com.crealytics.spark.excel.DefaultWorkbookReader.$anonfun$openWorkbook$1(WorkbookReader.scala:42)

on the other hand, when I download and unzip the spark-excel JAR and run

javap -verbose com/crealytics/spark-excel_2.12/0.11.2/org/apache/poi/openxml4j/opc/internal/ZipHelper.class

it clearly shows that the above method is using the shaded classes:

  public static org.apache.poi.openxml4j.util.ZipArchiveThresholdInputStream openZipStream(java.io.InputStream) throws java.io.IOException;
    descriptor: (Ljava/io/InputStream;)Lorg/apache/poi/openxml4j/util/ZipArchiveThresholdInputStream;
    flags: ACC_PUBLIC, ACC_STATIC
    Code:
      stack=5, locals=2, args_size=1
         0: aload_0
         1: invokestatic  #108                // Method org/apache/poi/poifs/filesystem/FileMagic.prepareToCheckMagic:(Ljava/io/InputStream;)Ljava/io/InputStream;
         4: astore_1
         5: aload_1
         6: invokestatic  #139                // Method verifyZipHeader:(Ljava/io/InputStream;)V
         9: new           #141                // class org/apache/poi/openxml4j/util/ZipArchiveThresholdInputStream
        12: dup
        13: new           #143                // class shadeio/commons/compress/archivers/zip/ZipArchiveInputStream
        16: dup
        17: aload_1
        18: invokespecial #145                // Method shadeio/commons/compress/archivers/zip/ZipArchiveInputStream."<init>":(Ljava/io/InputStream;)V
        21: invokespecial #146                // Method org/apache/poi/openxml4j/util/ZipArchiveThresholdInputStream."<init>":(Ljava/io/InputStream;)V
        24: areturn
jornfranke commented 5 years ago

Maybe some of your dependencies have POI as a dependency and then this dependency does not use the shaded commons-io

nightscape commented 5 years ago

@jornfranke That was exactly the problem. spark-excel itself still adds POI as a dependency (see https://github.com/hammerlab/sbt-parent/issues/32). I'm now bundling and shading all dependencies that require commons-io.

I just released 0.12.0 with this fix (and Scala 2.12 compatibility), it should appear on Maven Central in the next few hours. Please go ahead and try it. I'll close this issue until there are reports of the problem occurring again.

jlscott3 commented 5 years ago

Confirmed 0.12.0 working in AWS Glue now - thanks for the quick response!

ecv-stan commented 5 years ago

@jlscott3 hi, do u mind to share how do u get this to work in glue? do u just add the spark-excel_2.12-0.12.0.jar to Jar lib path in the glue job? do u need to set anything else? I tried spark-excel_2.12-0.12.0.jar, spark-excel_2.11-0.12.0.jar, spark-excel_2.11-0.11.1.jar but all throw error... thanks in advance.


Update:

Finally I got it working in AWS glue.

Below are the jars I used: ooxml-schemas-1.4.jar poi-4.0.0.jar spark-excel_2.11-0.12.0.jar xmlbeans-3.1.0.jar

Hope it helps.

nightscape commented 5 years ago

It turns out something went wrong while publishing spark-excel_2.12-0.12.0.jar, so that version actually still had this problem. In case anyone wants to try with Scala 2.12 it should work with spark-excel 0.12.1.

tochandrashekhar commented 4 years ago

@jlscott3 hi, do u mind to share how do u get this to work in glue? do u just add the spark-excel_2.12-0.12.0.jar to Jar lib path in the glue job? do u need to set anything else? I tried spark-excel_2.12-0.12.0.jar, spark-excel_2.11-0.12.0.jar, spark-excel_2.11-0.11.1.jar but all throw error... thanks in advance.

Update:

Finally I got it working in AWS glue.

Below are the jars I used: ooxml-schemas-1.4.jar poi-4.0.0.jar spark-excel_2.11-0.12.0.jar xmlbeans-3.1.0.jar

Hope it helps.

Do we need to import in spark code.. Can you please provide some sample code?

xvinosh commented 4 years ago

Did anyone get the solution to this problem. I am facing the same problem with the latest version of spark-excel -> 0.13.5

scala> val file = new File("/Users/vinodsharma/Documents/Spark-Excel/People.xlsx") file: java.io.File = /Users/vinodsharma/Documents/Spark-Excel/People.xlsx

scala> val fIP = new FileInputStream(file) fIP: java.io.FileInputStream = java.io.FileInputStream@236ec69

scala> val wb = new XSSFWorkbook(fIP) java.lang.IllegalArgumentException: InputStream of class class org.apache.commons.compress.archivers.zip.ZipArchiveInputStream is not implementing InputStreamStatistics. at org.apache.poi.openxml4j.util.ZipArchiveThresholdInputStream.(ZipArchiveThresholdInputStream.java:65) at org.apache.poi.openxml4j.opc.internal.ZipHelper.openZipStream(ZipHelper.java:178) at org.apache.poi.openxml4j.opc.ZipPackage.(ZipPackage.java:104) at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:307) at org.apache.poi.ooxml.util.PackageHelper.open(PackageHelper.java:47) at org.apache.poi.xssf.usermodel.XSSFWorkbook.(XSSFWorkbook.java:309) ... 51 elided

How to go about changing the classpath for common compress jar: In my case, the version of compress jar is org.apache.commons#commons-compress;1.20

nightscape commented 4 years ago

You might have to manually exclude commons-compress from the dependencies due to this problem which I don't yet know how to fix: https://github.com/hammerlab/sbt-parent/issues/32

xvinosh commented 4 years ago

@nightscape : In my case, I tried all the versions from 0.12.1 to 0.13.5, none worked. Downloaded the latest version of common compress manually which spark-shell showed as if it has downloaded while launching the spark shell with packages option but actually did not(as I could not find anywhere in the maven repo dir where it said, it’s downloaded) Version: 1.20 Then explicitly mentioned the jar name in the driver’s classpath as mentioned below: $ spark-shell --driver-class-path /home/xvinosh/.m2/repository/org/apache/commons/commons-compress/1.20/commons-compress-1.20jar

This worked. Hope it helps other.

sjahongir commented 3 years ago

@nightscape hi I tried the 0.9.0 version with spark 2.3.1 (local and cluster mode). It is worked but when I use a large excel file, a spark cannot process it.

Then tried higher versions of your library from 0.10:

Exception in thread "main" java.lang.IllegalArgumentException: InputStream of class class org.apache.commons.compress.archivers.zip.ZipFile$1 is not implementing InputStreamStatistics. at org.apache.poi.openxml4j.util.ZipArchiveThresholdInputStream.<init>(ZipArchiveThresholdInputStream.java:63) at org.apache.poi.openxml4j.util.ZipSecureFile.getInputStream(ZipSecureFile.java:147) at org.apache.poi.openxml4j.util.ZipSecureFile.getInputStream(ZipSecureFile.java:34) at org.apache.poi.openxml4j.util.ZipFileZipEntrySource.getInputStream(ZipFileZipEntrySource.java:66) at org.apache.poi.openxml4j.opc.ZipPackage.getPartsImpl(ZipPackage.java:258) at org.apache.poi.openxml4j.opc.OPCPackage.getParts(OPCPackage.java:725) at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:238) at etl.io.XlsxReader.open(XlsxReader.scala:135) at etl.io.XlsxReader.<init>(XlsxReader.scala:153) at etl.connectors.excel.ExcelConnector.readXlsx(ExcelConnector.scala:194) at etl.connectors.excel.ExcelConnector.read(ExcelConnector.scala:119) at etl.io.DatasetReader$.read(DatasetReader.scala:47) at etl.DatasetResolver$.byModel(DatasetResolver.scala:58) at etl.App$.processTask(App.scala:105) at etl.App$.main(App.scala:65) at etl.App.main(App.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:894) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:198) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:228) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

nightscape commented 3 years ago

@sjahongir can you try the recommendation from @xvinosh?

SwapnaRavi21 commented 3 years ago

@nightscape I still see issues with spark excel compatible with 2.12.. Using 0.13.4 I face java.lang.IllegalArgumentException: InputStream of class class org.apache.commons.compress.archivers.zip.ZipArchiveInputStream is not implementing InputStreamStatistics. at org.apache.poi.openxml4j.util.ZipArchiveThresholdInputStream.(ZipArchiveThresholdInputStream.java:65)

Using 0.12.0 or 0.12.1 I get useHeader errors and as well as the above. Nothing is working out. Tried using commons-compress-1.20.jar along with other jars in my spark submit. No use.

Currently we are migrating to scala 2.12, could you pls suggest the spark excel version for the same without these issues?

nightscape commented 3 years ago

Hi @SwapnaRavi21, I would recommend always using the latest version available for your Spark & Scala version. @quanghgx and me will try to figure out a way to build against multiple versions of Spark. Unfortunately I'm under quite some deadline pressure at the moment and will probably only get to this the second week of November. If you have experience with SBT, we'd be happy for any contributions!

SwapnaRavi21 commented 3 years ago

@nightscape yes we are onto latest scala only 2.12. But this fix is available only in 2.11 and not in 2.12 right. Sure thanks. Meanwhile is there any alternative for this dependency so we can use that in 2.12 until the fix is provided in this version.

neontty commented 1 month ago

currently seeing this behavior in Databricks in multiple runtime versions (14.3LTS, 15.4LTS) ; scala 2.12 spark 3.5.0

version : com.crealytics:spark-excel_2.12:3.5.0_0.20.3

Caused by: java.lang.NoSuchMethodError: org.apache.commons.compress.archivers.zip.ZipArchiveOutputStream.putArchiveEntry(Lorg/apache/commons/compress/archivers/zip/ZipArchiveEntry;)V
    at org.apache.poi.openxml4j.opc.internal.ZipContentTypeManager.saveImpl(ZipContentTypeManager.java:65)
    at org.apache.poi.openxml4j.opc.internal.ContentTypeManager.save(ContentTypeManager.java:450)
    at org.apache.poi.openxml4j.opc.ZipPackage.saveImpl(ZipPackage.java:608)
    at org.apache.poi.openxml4j.opc.OPCPackage.save(OPCPackage.java:1532)
    at org.apache.poi.ooxml.POIXMLDocument.write(POIXMLDocument.java:227)
    at com.crealytics.spark.excel.v2.ExcelGenerator.close(ExcelGenerator.scala:177)
    at com.crealytics.spark.excel.v2.ExcelOutputWriter.close(ExcelOutputWriter.scala:34)
    at org.apache.spark.sql.execution.datasources.FileFormatDataWriter.releaseCurrentWriter(FileFormatDataWriter.scala:71)
    at org.apache.spark.sql.execution.datasources.FileFormatDataWriter.releaseResources(FileFormatDataWriter.scala:82)
    at org.apache.spark.sql.execution.datasources.FileFormatDataWriter.$anonfun$commit$2(FileFormatDataWriter.scala:141)
    at org.apache.spark.sql.execution.datasources.FileFormatDataWriter.enrichWriteError(FileFormatDataWriter.scala:97)
    at org.apache.spark.sql.execution.datasources.FileFormatDataWriter.commit(FileFormatDataWriter.scala:140)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$executeTask$1(FileFormatWriter.scala:560)
    at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1560)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:566)
    at org.apache.spark.sql.execution.datasources.WriteFilesExec.$anonfun$doExecuteWrite$1(WriteFiles.scala:125)
    at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:938)
    at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:938)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60)
    at org.apache.spark.rdd.RDD.$anonfun$computeOrReadCheckpoint$1(RDD.scala:413)
    at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:410)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:377)
    at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$3(ResultTask.scala:82)
    at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
    at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$1(ResultTask.scala:82)
    at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
    at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:211)
    at org.apache.spark.scheduler.Task.doRunTask(Task.scala:199)
    at org.apache.spark.scheduler.Task.$anonfun$run$5(Task.scala:161)
    at com.databricks.unity.UCSEphemeralState$Handle.runWith(UCSEphemeralState.scala:51)
    at com.databricks.unity.HandleImpl.runWith(UCSHandle.scala:104)
    at com.databricks.unity.HandleImpl.$anonfun$runWithAndClose$1(UCSHandle.scala:109)
    at scala.util.Using$.resource(Using.scala:269)
    at com.databricks.unity.HandleImpl.runWithAndClose(UCSHandle.scala:108)
    at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:155)
    at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
    at org.apache.spark.scheduler.Task.run(Task.scala:102)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$10(Executor.scala:1036)
    at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
    at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:110)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:1039)
    at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
    at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:926)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    ... 1 more

excluding org.apache.commons:commons-compress building our spark Jar application did not help. Also adding an explicit dependency for commons-compress did not help.

are there any recommendations for workarounds?

pjfanning commented 1 month ago

@neontty looks like Spark defaults to an out of date CVE ridden version of commons-compress.

https://mvnrepository.com/artifact/org.apache.spark/spark-core_2.13/3.5.3

POI uses a newer version of commons-compress and must rely on methods from that were added or changed recently.

Can you try to upgrade the commons-compress jar that Spark uses? Maybe best to ask on Spark mailing lists or forums if you don't know how to do this.

neontty commented 2 weeks ago

hi @pjfanning , thanks for the quick response. I'm just looking into this a bit more and trying to understand why the shading rule isn't enough in build.sc:67

is it because of this discussion regarding shading in the mill build system? https://github.com/com-lihaoyi/mill/issues/3815

nightscape commented 2 weeks ago

@neontty thanks for commenting over at Mill 👍 If you and/or your colleagues could pick that issue up, that would be great. With the bounty on top, you could do a nice celebration with your colleagues 🍻 😄