py4j.protocol.Py4JJavaError: An error occurred while calling o83.load. : java.lang.NoClassDefFoundError: org/apache/commons/compress/utils/InputStreamStatistics

anishsatalkar commented 4 years ago

I am trying to load a very simple excel file. I am using version 0.11.1 on pyspark 2.4.3. There doesn't seem to be a problem with the options I am providing since I am referring to the docs of v0.11.1.

I am getting the following error: (I have also added some additional info so that the debugging becomes easier).

[hadoop@ip-10-228-0-65 ~]$ pyspark
Python 2.7.16 (default, Oct 14 2019, 21:26:56)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-28)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
20/05/03 17:26:42 WARN HiveConf: HiveConf of name hive.server2.thrift.url does not exist
20/05/03 17:26:44 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.4.4
      /_/

Using Python version 2.7.16 (default, Oct 14 2019 21:26:56)
SparkSession available as 'spark'.
>>> df = spark.read.format("com.crealytics.spark.excel") \
...     .option("useHeader", "false") \
...     .option("maxRowsInMemory", 50) .\
...     load("s3://<bucket-name>/<directory>/simple_data.xlsx")
20/05/03 17:27:16 WARN HadoopFileSystemOwner: found no group information for hadoop (auth:SIMPLE), using hadoop as primary group
Traceback (most recent call last):
  File "<stdin>", line 4, in <module>
  File "/usr/lib/spark/python/pyspark/sql/readwriter.py", line 166, in load
    return self._df(self._jreader.load(path))
  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
  File "/usr/lib/spark/python/pyspark/sql/utils.py", line 79, in deco
    raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.IllegalArgumentException: u'InputStream of class class org.apache.commons.compress.archivers.zip.ZipFile$1 is not implementing InputStreamStatistics.'
>>>

Here is the spark.jars list I am providing in spark-defaults.conf file : spark.jars /mnt/ojdbc6.jar,/mnt/mysql-connector-java.jar,/mnt/sqljdbc41.jar,/mnt/zs_edge_s3_direct_output_committer_2.11-1.0.jar,/usr/lib/hive/lib/hive-contrib.jar,/mnt/FuzzyJars/hive-udf-textmining-1.0-SNAPSHOT-jar-with-dependencies.jar,/mnt/postgresql-42.2.2.jar,/mnt/hive-hcatalog-core-1.1.1.jar,/mnt/scala-library-2.10.5.jar,/mnt/spark-xml_2.10-0.4.1.jar,/mnt/spark-sas7bdat-2.0.0-s_2.10.jar,/mnt/parso-2.0.7.jar,/mnt/terajdbc4.jar,/mnt/tdgssconfig.jar,/mnt/minimal-json-0.9.4.jar,/mnt/jackson-core-asl-1.9.13.jar,/mnt/commons-compress-1.18.jar,/mnt/avro-1.7.6.jar,/mnt/jackson-mapper-asl-1.9.13.jar,/mnt/paranamer-2.3.jar,/mnt/slf4j-api-1.7.25.jar,/mnt/spark-avro_2.11-2.4.3.jar,/mnt/snappy-java-1.0.5.jar,/mnt/xz-1.0.jar,/mnt/spark-redshift_2.11-3.0.0.jar,/mnt/scala-library-2.11.7.jar,/mnt/RedshiftJDBC42-1.2.7.1003.jar,/mnt/commons-codec-1.10.jar,/mnt/commons-collections4-4.2.jar,/mnt/curvesapi-1.04.jar,/mnt/icu4j-4.6.jar,/mnt/jackson-annotations-2.8.0.jar,/mnt/jackson-core-2.8.10.jar,/mnt/jackson-databind-2.8.10.jar,/mnt/java-cup-10k.jar,/mnt/joda-convert-2.0.1.jar,/mnt/joda-time-2.9.9.jar,/mnt/org.eclipse.wst.xml.xpath2.processor-2.1.100.jar,/mnt/poi-4.0.0.jar,/mnt/poi-ooxml-4.0.0.jar,/mnt/poi-ooxml-schemas-4.0.0.jar,/mnt/scala-library-2.11.11.jar,/mnt/scala-xml_2.11-1.1.1.jar,/mnt/slf4j-api-1.7.12.jar,/mnt/spark-excel_2.11-0.11.1.jar,/mnt/spoiwo_2.11-1.4.1.jar,/mnt/stax-api-1.0.1.jar,/mnt/xerces2-xsd11-2.11.1.jar,/mnt/xlsx-streamer-2.0.0.jar,/mnt/xml-apis-1.4.01.jar,/mnt/xml-resolver-1.2.jar,/mnt/xmlbeans-3.0.1.jar

Steps to Reproduce

Open a pyspark shell using pyspark in the linux terminal.

Paste the following in the shell:

df = spark.read.format("com.crealytics.spark.excel") \
.option("useHeader", "false") \
.option("maxRowsInMemory", 50) .\
load("s3://<my_s3_bucket>/<directory>/simple_data.xlsx")

df.show()

Deleting the spark.jars list from spark-defaults.conf and then using the command spark-submit --packages com.crealytics:spark-excel_2.11:0.11.1 excel_import.py works fine as expected. excel_import.py contains the following snippet:

```
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
sc = SparkContext('local')
spark = SparkSession(sc)

df = spark.read.format("com.crealytics.spark.excel") \
.option("useHeader", "false") \
.option("maxRowsInMemory", 50) .\
load("s3://<my_s3_bucket>/<directory>/simple_data.xlsx")

df.show()
```

My Environment

Spark version : 2.4.4 and
Language : Python 2.7.16
Spark-Excel version: 0.11.1
Operating System : Linux based AWS EMR.
EMR version 5.28.1

I am using an AWS EMR cluster with version 5.28.1. I am using spark-excel version 0.11.1 since it is the only version that I found which is supporting maxRowsInMemory correctly in this environment. I tried v0.13.1, but as mentioned earlier, maxRowsInMemory is not working properly, ie : The dataframe is getting generated with just the first row.

nightscape commented 4 years ago

Hi @anishsatalkar, thanks for the detailed report. Could you try the current version of spark-excel? There were some changes w.r.t. shading packages. When you update, please have a look at the CHANGELOG and the updated README.

anishsatalkar commented 4 years ago

Hi @nightscape , thanks for the quick reply. I would have liked to use the latest version, however, maxRowsInMemory doesn't seem to be working on both EMR and my local Windows machine.

I am attaching a detailed log of all the versions on which I tested maxRowsInMemory while loading the excel file under question.

Steps to reproduce (These were performed on my Windows based machine. Same behavior was observed on the EMR cluster as well).

Start Powershell
Run the command spark-submit --packages com.crealytics:spark-excel_2.11:0.13.1 excel_import.py
Decrement the version one version at a time and observe the output.

Note, I am not forcing any jars to override any behavior of execution. The command in step 2 kept downloading the required jars with the correct versions as and when required.

Files which I have used:

My spark-defaults.conf file (can be ignored since spark-submit was overriding the spark-excel package anyways): spark-defaults.conf.txt

My excel_import.py file:

from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
sc = SparkContext('local')
spark = SparkSession(sc)

df = spark.read.format("com.crealytics.spark.excel") \
    .option("header", "false") \
    .option("maxRowsInMemory", 50) .\
    load("file:///C:\\Users\\<user>\\Documents\\simple_data.xlsx")

df.show()

excel_import.py.txt

The log with all the outputs corresponding to the versions from 0.13.1 to 0.11.1 spark_excel_runs.log

The format for this log is : [version_0] [its output]

[version_1] [its output] . . .
My simple_data.xlsx file: simple_data.xlsx

My setup

OS : Windows 10
Python version : 2.7.16
Spark : 2.4.4
spark-excel : Multiple versions as stated earlier.

Am I missing something here?

Note : Wherever required, I was changing the options provided to the load method corresponding to the spark-excel version.

anishsatalkar commented 4 years ago

Anybody has any idea why maxRowsInMemory must be failing for that excel file for all versions post v0.11.1 for Scala 2.11 based Spark?

nightscape commented 4 years ago

Unfortunately I don't...

anishsatalkar commented 4 years ago

@nightscape Can you provide me with a simple excel file which I can load into a dataframe using the maxRowsInMemory option on versions 2.11:0.11.1 and above?

anishsatalkar commented 4 years ago

I'll further simplify my issue.

With spark-excel 2.11:013.1 - maxRowsInMemory is not working for creating a dataframe from an excel file. I have attached all the detailed logs of all the exceptions / outputs I am receiving for this version (as well as for all the versions till 2.11:0.12.0 in the file spark_excel_runs.log in my previous message)

With spark-excel 2.11:0.11.1 - maxRowsInMemory works perfectly well for creating a dataframe from that same excel file.

However, spark-excel 2.11:0.11.1 is not suitable for me because :

We download the required jars on our EMR cluster from one of our own S3 buckets, which are required for spark-excel 2.11:0.11.1.
We update the spark-defaults.conf file and set the paths in the section spark-jars to these downloaded jars. This helps us to avoid any case in which the original repository of these jars is changed. We have faced similar issues earlier.

Even this works fine until we add the path of the jar poi-ooxml-4.0.0.jar to the spark-jars list. This gives an error :

Traceback (most recent call last):
File "C:/Users/as21616/Backup_Codes/excel_import.py", line 9, in <module>
load("file:///C:\\Users\\as21616\\Documents\\simple_data.xlsx")
File "C:\Users\as21616\Spark\python\lib\pyspark.zip\pyspark\sql\readwriter.py", line 166, in load 
File "C:\Users\as21616\Spark\python\lib\py4j-0.10.7-src.zip\py4j\java_gateway.py", line 1257, in 
__call__
File "C:\Users\as21616\Spark\python\lib\pyspark.zip\pyspark\sql\utils.py", line 79, in deco
pyspark.sql.utils.IllegalArgumentException: u'InputStream of class class 
org.apache.commons.compress.archivers.zip.ZipFile$1 is not implementing InputStreamStatistics.'

Thus, unless maxRowsInMemory works for 0.13.1 in my case, the only option I have is to use 0.11.1. I am aware that 0.13.1 solves issues related to shading packages. So @nightscape can you confirm if my understanding is correct that my only option is to wait for a newer version of spark-excel? Thanks!

nightscape commented 4 years ago

Hey @anishsatalkar, thanks for the detailed report! I would love to help (especially because you've gone to great lengths to help yourself), but I don't know when I'll find the time... Btw, have you checked if the maxRowsInMemory issue you have might be related to https://github.com/crealytics/spark-excel/issues/223?

nightscape commented 4 years ago

@anishsatalkar can you try 0.13.2+1-e186f935-SNAPSHOT? It's based on @pjfanning's fork of excel-stream-reader and might fix your problem (wishful thinking :wink:).

anishsatalkar commented 4 years ago

Hi @nightscape , thanks for the suggestion. I used 0.13.2+1-e186f935-SNAPSHOT with all its dependencies (assuming that only excel-streaming-reader v2.3.3 and its dependencies were to be added). I still get only the top row in the dataframe from the excel file.

anishsatalkar commented 4 years ago

I found the solution.

For the initial problem where I was getting an exception of py4j.protocol.Py4JJavaError: An error occurred while calling o83.load. : java.lang.NoClassDefFoundError: org/apache/commons/compress/utils/InputStreamStatistics for spark-excel version 2.11:0.11.1, the following worked :

I had to add the path of commons-compress-1.18.jars to spark.driver.extraClassPath and spark.executor.extraClassPath. Ideally, adding the path to spark.jars should have sufficed as the documentation suggests that setting the path here adds the jar to the classpaths of both the driver and executor. However adding it explicitly to both the classpaths worked in my case.

Regarding maxRowsInMemory not working for versions above 2.11:0.11.1, I still don't have a resolution on that and maybe a separate issue can be raised for it. But for now, ill be sticking with 2.11:0.11.1 and this case can be resolved now.

Thank you @nightscape for the help.

nightscape commented 4 years ago

Ok, thanks for writing the summary! Might help others in the future :+1:

kamalusha commented 2 years ago

I found the solution.

For the initial problem where I was getting an exception of py4j.protocol.Py4JJavaError: An error occurred while calling o83.load. : java.lang.NoClassDefFoundError: org/apache/commons/compress/utils/InputStreamStatistics for spark-excel version 2.11:0.11.1, the following worked :

I had to add the path of commons-compress-1.18.jars to spark.driver.extraClassPath and spark.executor.extraClassPath. Ideally, adding the path to spark.jars should have sufficed as the documentation suggests that setting the path here adds the jar to the classpaths of both the driver and executor. However adding it explicitly to both the classpaths worked in my case.

Regarding maxRowsInMemory not working for versions above 2.11:0.11.1, I still don't have a resolution on that and maybe a separate issue can be raised for it. But for now, ill be sticking with 2.11:0.11.1 and this case can be resolved now.

Thank you @nightscape for the help.

Hi, I tried your suggestion. some how it is throwing error for me. Is my below command correct?

spark-submit --conf="spark.driver.extraClassPath=/lib/com.crealytics_spark-excel_2.11-0.12.0.jar" --conf="spark.executor.extraClassPath=/lib/com.crealytics_spark-excel_2.11-0.12.0.jar" dist/excel_success.py

ERROR: py4j.protocol.Py4JJavaError: An error occurred while calling o39.load. : java.lang.ClassNotFoundException: Failed to find data source: com.crealytics.spark.excel. Please find packages at http://spark.apache.org/third-party-projects.html

nightscape commented 2 years ago

I would rather try sth. like

spark-submit --packages com.crealytics:spark-excel_2.11:0.12.0

nightscape / spark-excel