If you store a PDF test.pdf with the extension XLSX -> text.xlsx and read the file with a schema provided, you get an empty dataframe. Even if there is no workbook and no data to read.
Expected Behavior
If there is no workbook/sheet the reader should throw an exception/error.
Py4JJavaError: An error occurred while calling o575.load.
: java.io.IOException: Can't open workbook - unsupported file type: PDF
at org.apache.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:228)
at org.apache.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:185)
at com.crealytics.spark.excel.v2.ExcelHelper.getWorkbook(ExcelHelper.scala:120)
at com.crealytics.spark.excel.v2.ExcelHelper.getSheetData(ExcelHelper.scala:137)
at com.crealytics.spark.excel.v2.ExcelHelper.parseSheetData(ExcelHelper.scala:160)
at com.crealytics.spark.excel.v2.ExcelTable.infer(ExcelTable.scala:77)
at com.crealytics.spark.excel.v2.ExcelTable.inferSchema(ExcelTable.scala:48)
at org.apache.spark.sql.execution.datasources.v2.FileTable.$anonfun$dataSchema$4(FileTable.scala:71)
at scala.Option.orElse(Option.scala:447)
at org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema$lzycompute(FileTable.scala:71)
at org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema(FileTable.scala:65)
at org.apache.spark.sql.execution.datasources.v2.FileTable.schema$lzycompute(FileTable.scala:83)
at org.apache.spark.sql.execution.datasources.v2.FileTable.schema(FileTable.scala:81)
at org.apache.spark.sql.connector.catalog.Table.columns(Table.java:68)
at org.apache.spark.sql.execution.datasources.v2.FileTable.columns(FileTable.scala:37)
at org.apache.spark.sql.execution.datasources.v2.FileDataSourceV2.inferSchema(FileDataSourceV2.scala:95)
at org.apache.spark.sql.execution.datasources.v2.FileDataSourceV2.inferSchema$(FileDataSourceV2.scala:92)
at com.crealytics.spark.excel.v2.ExcelDataSource.inferSchema(ExcelDataSource.scala:27)
at org.apache.spark.sql.execution.datasources.v2.DataSourceV2Utils$.getTableFromProvider(DataSourceV2Utils.scala:99)
at org.apache.spark.sql.execution.datasources.v2.DataSourceV2Utils$.loadV2Source(DataSourceV2Utils.scala:237)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$1(DataFrameReader.scala:344)
at scala.Option.flatMap(Option.scala:271)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:342)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:245)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:397)
at py4j.Gateway.invoke(Gateway.java:306)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:199)
at py4j.ClientServerConnection.run(ClientServerConnection.java:119)
at java.lang.Thread.run(Thread.java:750)
Steps To Reproduce
Store any PDF with the file extension .xlsx and then run following code:
Is there an existing issue for this?
Current Behavior
If you store a PDF test.pdf with the extension XLSX -> text.xlsx and read the file with a schema provided, you get an empty dataframe. Even if there is no workbook and no data to read.
Expected Behavior
If there is no workbook/sheet the reader should throw an exception/error.
Exactly as it happens if you do not add a schema:
Py4JJavaError: An error occurred while calling o575.load. : java.io.IOException: Can't open workbook - unsupported file type: PDF at org.apache.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:228) at org.apache.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:185) at com.crealytics.spark.excel.v2.ExcelHelper.getWorkbook(ExcelHelper.scala:120) at com.crealytics.spark.excel.v2.ExcelHelper.getSheetData(ExcelHelper.scala:137) at com.crealytics.spark.excel.v2.ExcelHelper.parseSheetData(ExcelHelper.scala:160) at com.crealytics.spark.excel.v2.ExcelTable.infer(ExcelTable.scala:77) at com.crealytics.spark.excel.v2.ExcelTable.inferSchema(ExcelTable.scala:48) at org.apache.spark.sql.execution.datasources.v2.FileTable.$anonfun$dataSchema$4(FileTable.scala:71) at scala.Option.orElse(Option.scala:447) at org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema$lzycompute(FileTable.scala:71) at org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema(FileTable.scala:65) at org.apache.spark.sql.execution.datasources.v2.FileTable.schema$lzycompute(FileTable.scala:83) at org.apache.spark.sql.execution.datasources.v2.FileTable.schema(FileTable.scala:81) at org.apache.spark.sql.connector.catalog.Table.columns(Table.java:68) at org.apache.spark.sql.execution.datasources.v2.FileTable.columns(FileTable.scala:37) at org.apache.spark.sql.execution.datasources.v2.FileDataSourceV2.inferSchema(FileDataSourceV2.scala:95) at org.apache.spark.sql.execution.datasources.v2.FileDataSourceV2.inferSchema$(FileDataSourceV2.scala:92) at com.crealytics.spark.excel.v2.ExcelDataSource.inferSchema(ExcelDataSource.scala:27) at org.apache.spark.sql.execution.datasources.v2.DataSourceV2Utils$.getTableFromProvider(DataSourceV2Utils.scala:99) at org.apache.spark.sql.execution.datasources.v2.DataSourceV2Utils$.loadV2Source(DataSourceV2Utils.scala:237) at org.apache.spark.sql.DataFrameReader.$anonfun$load$1(DataFrameReader.scala:344) at scala.Option.flatMap(Option.scala:271) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:342) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:245) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:397) at py4j.Gateway.invoke(Gateway.java:306) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:199) at py4j.ClientServerConnection.run(ClientServerConnection.java:119) at java.lang.Thread.run(Thread.java:750)
Steps To Reproduce
Store any PDF with the file extension .xlsx and then run following code:
Environment
Anything else?
No response