[BUG] Cannot read files into dataframe in Databricks 11.3 LTS Runtime 3.3.0 Spark

james-miles-ccy commented 1 year ago

Is there an existing issue for this?

[X] I have searched the existing issues

Current Behavior

When running v2 excel pySpark code below in Databricks 11.3 LTS Runtime:

df = spark.read.format("excel") \ .option("header", True) \ .option("inferSchema", True) \ .load(fr"{folderpath}//.xlsx") display(df)

I receive the following error upon attempting to display or use the resulting dataframe:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 101) (10.94.235.131 executor 1): java.lang.AbstractMethodError: org.apache.spark.sql.execution.datasources.v2.FilePartitionReaderFactory.options()Lorg/apache/spark/sql/catalyst/FileSourceOptions;

Expected Behavior

The resulting Dataframe should display correctly.

Steps To Reproduce

set the folderpath variable to a location containing excel files, and run the below python code in latest runtime of Databricks:

df = spark.read.format("excel") \ .option("header", True) \ .option("inferSchema", True) \ .load(fr"{folderpath}//.xlsx") display(df)

Environment

- Spark version:3.3.0
- Spark-Excel version:0.18.5
- OS:Windows 10
- Cluster environment

Anything else?

No response

nightscape commented 1 year ago

Hey @james-miles-ccy, the Spark-Excel version should consist of the Spark version and the version of Spark-Excel itself. You were only specifying the version of Spark-Excel. Can you check you were using 3.3.1_0.18.5?

james-miles-ccy commented 1 year ago

Yes I am using 3.3.1_0.18.5

nightscape commented 1 year ago

Can you check the same thing with a local or other non-Databricks Spark 3.3.0? We already had the case once where Databricks used a slightly different and not fully API-compatible version of Spark in their Runtime than the officially published one.

james-miles-ccy commented 1 year ago

I have installed Pyspark/spark-excel locally and V1 format works fine and generates dataframes in 3.3.1 spark version, but using a path for multiple files (ie V2 format) is causing issues where cells are hanging/not completing. I am using the same spark-excel version as stated above.

nightscape commented 1 year ago

Is it the same error/issue as on DataBricks?

james-miles-ccy commented 1 year ago

No, in Databricks you receive the error listed in my original comment, where as local causes endless/ non completing execution.

FYI, this is only an issue for v2, v1 works in both Databricks and local.

snehawankhade commented 1 year ago

I am facing same issue with V2 (Spark version:3.3.0, Spark-excel: 3.3.1_0.18.5). v1 works but not completely. input_file_name() returns empty string.

nightscape commented 1 year ago

input_file_name is only supported in v2. Unfortunately, I didn't have time to look into the original issue.

dazfuller commented 1 year ago

Hey @nightscape. This got mentioned in our implementation as well

I think I've traced the issue down to Databricks using a patched spark runtime in the 11.x runtimes (and 12.0 beta runtime) which includes a change from the master branch of Spark which isn't in the 3.3 support branch.

I'm looking into this further at the moment and I'll shout if I find anything

dazfuller commented 1 year ago

Just to add an update. I've been talking with Databricks and there's a fix coming which we'll resolve this in the 11.x and 12.x runtimes. Should hopefully be coming in January

nightscape commented 1 year ago

@dazfuller thanks a lot for pushing this forward and keeping us updated here!! We had a similar issue before, so I guess Databricks breaking compatibility with the Open Source Spark version is sth. we have to keep an eye on...

james-miles-ccy commented 1 year ago

Hi All, FYI looks like this has all been resolved by Databricks on 12.1 runtime!

nightscape / spark-excel