Error while reading mounted xlsx: Could not initialize class shadeio.poi.xssf.model.SharedStringsTable

ghost commented 3 years ago

I am using Azure Databricks and I am trying to read an Excel file (xlsx) from a Storage account (ADLS Gen2). Because I get an 'Anonymous access' error when I connect to the file using the wasbs path I mounted it and tried to read the excel from there. This is my code:

`df = spark.read \ .format("csv") \ .option("header", "true") \ .option("delimiter", ";") \ .load("/mnt/mountPoint/Budget.csv")

df = spark.read \ .format("com.crealytics.spark.excel") \ .option("header", "true") \ .option("sheetName", "Sheet1") \ .load("/mnt/mountPoint/Budget.xls")

df = spark.read \ .format("com.crealytics.spark.excel") \ .option("header", "true") \ .option("sheetName", "Sheet1") \ .load("/mnt/mountPoint/Budget.xlsx") `

The first command succeeds and I get the headers from the file. A df.show() will show me the content. The second command (using the xls) succeeds as well and I get the schema and content. The third command fails with this error: java.lang.NoClassDefFoundError: Could not initialize class shadeio.poi.xssf.model.SharedStringsTable

I am using Databricks runtime 8.3 with Apache Spark 3.1.1 and Scala 2.12. What I have tried so far (all with the same error):

Different version of the crealytics library. I tries 14.0, 13.7 and 13.6. All of them for scala 2.12
The above code is in Python; I also tried it in scala
I copied the content of the file (just the cells with data) to a new file and stored as xlsx and xls.
Use different sheet names. The file has just one sheet named 'Sheet1'

This this the full stack trace. Any help is very much appreciated!' `--------------------------------------------------------------------------- Py4JJavaError Traceback (most recent call last)

in 11 .load("/mnt/mountPoint/Budget.xls") 12 ---> 13 df = spark.read \ 14 .format("com.crealytics.spark.excel") \ 15 .option("header", "true") \ /databricks/spark/python/pyspark/sql/readwriter.py in load(self, path, format, schema, **options) 202 self.options(**options) 203 if isinstance(path, str): --> 204 return self._df(self._jreader.load(path)) 205 elif path is not None: 206 if type(path) != list: /databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py in __call__(self, *args) 1302 1303 answer = self.gateway_client.send_command(command) -> 1304 return_value = get_return_value( 1305 answer, self.gateway_client, self.target_id, self.name) 1306 /databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw) 115 def deco(*a, **kw): 116 try: --> 117 return f(*a, **kw) 118 except py4j.protocol.Py4JJavaError as e: 119 converted = convert_exception(e.java_exception) /databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name) 324 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client) 325 if answer[1] == REFERENCE_TYPE: --> 326 raise Py4JJavaError( 327 "An error occurred while calling {0}{1}{2}.\n". 328 format(target_id, ".", name), value) Py4JJavaError: An error occurred while calling o714.load. : java.lang.NoClassDefFoundError: Could not initialize class shadeio.poi.xssf.model.SharedStringsTable at shadeio.poi.ooxml.POIXMLFactory.createDocumentPart(POIXMLFactory.java:61) at shadeio.poi.ooxml.POIXMLDocumentPart.read(POIXMLDocumentPart.java:684) at shadeio.poi.ooxml.POIXMLDocument.load(POIXMLDocument.java:180) at shadeio.poi.xssf.usermodel.XSSFWorkbook.(XSSFWorkbook.java:288) at shadeio.poi.xssf.usermodel.XSSFWorkbookFactory.createWorkbook(XSSFWorkbookFactory.java:97) at shadeio.poi.xssf.usermodel.XSSFWorkbookFactory.createWorkbook(XSSFWorkbookFactory.java:147) at shadeio.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:256) at shadeio.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:221) at com.crealytics.spark.excel.DefaultWorkbookReader.$anonfun$openWorkbook$1(WorkbookReader.scala:49) at scala.Option.fold(Option.scala:251) at com.crealytics.spark.excel.DefaultWorkbookReader.openWorkbook(WorkbookReader.scala:49) at com.crealytics.spark.excel.WorkbookReader.withWorkbook(WorkbookReader.scala:14) at com.crealytics.spark.excel.WorkbookReader.withWorkbook$(WorkbookReader.scala:13) at com.crealytics.spark.excel.DefaultWorkbookReader.withWorkbook(WorkbookReader.scala:45) at com.crealytics.spark.excel.ExcelRelation.excerpt$lzycompute(ExcelRelation.scala:31) at com.crealytics.spark.excel.ExcelRelation.excerpt(ExcelRelation.scala:31) at com.crealytics.spark.excel.ExcelRelation.headerColumns$lzycompute(ExcelRelation.scala:102) at com.crealytics.spark.excel.ExcelRelation.headerColumns(ExcelRelation.scala:101) at com.crealytics.spark.excel.ExcelRelation.$anonfun$inferSchema$1(ExcelRelation.scala:163) at scala.Option.getOrElse(Option.scala:189) at com.crealytics.spark.excel.ExcelRelation.inferSchema(ExcelRelation.scala:162) at com.crealytics.spark.excel.ExcelRelation.(ExcelRelation.scala:35) at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:35) at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:13) at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:8) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:390) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:432) at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:399) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:399) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:286) at sun.reflect.GeneratedMethodAccessor274.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380) at py4j.Gateway.invoke(Gateway.java:295) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:251) at java.lang.Thread.run(Thread.java:748)`

udossa commented 2 years ago

Hi guys, any update on this error? I have the same issue

quanghgx commented 2 years ago

Hi @thijsnijhuis and @udossa

Could you please try again with the format from: "com.crealytics.spark.excel" -> "excel"?
```
.format("excel")
```
And, please help take a look for list of dependencies for spark-excel to work. This wiki might has some useful idea

Credit to #133 Apache commons dependency issue by @jakeatmsft and @fwani solution

ghost commented 2 years ago

@quanghgx , thanks for your reply. I have changed it but now I simply get this eror: java.lang.ClassNotFoundException: Failed to find data source: excel. Please find packages at http://spark.apache.org/third-party-projects.html

I will need to take a look at the wiki link later on. Thanks!

fwani commented 2 years ago

@thijsnijhuis I think, you should add a dependency for excel that is com.crealytics:spark-excel_2.12 with specific version, first. (because the error is java.lang.ClassNotFoundException: Failed to find data source: excel) https://github.com/crealytics/spark-excel#linking

abhisrphoenix commented 2 years ago

Please try and change the library installation to Maven, that resolved my issue.

nightscape / spark-excel

Error while reading mounted xlsx: Could not initialize class shadeio.poi.xssf.model.SharedStringsTable #438