Closed anishsatalkar closed 4 years ago
Hi @anishsatalkar, thanks for the detailed report. Could you try the current version of spark-excel? There were some changes w.r.t. shading packages. When you update, please have a look at the CHANGELOG and the updated README.
Hi @nightscape , thanks for the quick reply. I would have liked to use the latest version, however, maxRowsInMemory doesn't seem to be working on both EMR and my local Windows machine.
I am attaching a detailed log of all the versions on which I tested maxRowsInMemory
while loading the excel file under question.
spark-submit --packages com.crealytics:spark-excel_2.11:0.13.1 excel_import.py
Note, I am not forcing any jars to override any behavior of execution. The command in step 2 kept downloading the required jars with the correct versions as and when required.
My spark-defaults.conf file (can be ignored since spark-submit was overriding the spark-excel package anyways): spark-defaults.conf.txt
My excel_import.py file:
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
sc = SparkContext('local')
spark = SparkSession(sc)
df = spark.read.format("com.crealytics.spark.excel") \
.option("header", "false") \
.option("maxRowsInMemory", 50) .\
load("file:///C:\\Users\\<user>\\Documents\\simple_data.xlsx")
df.show()
The log with all the outputs corresponding to the versions from 0.13.1 to 0.11.1 spark_excel_runs.log
The format for this log is : [version_0] [its output]
[version_1] [its output] . . .
My simple_data.xlsx file: simple_data.xlsx
Am I missing something here?
Note : Wherever required, I was changing the options provided to the load method corresponding to the spark-excel version.
Anybody has any idea why maxRowsInMemory
must be failing for that excel file for all versions post v0.11.1 for Scala 2.11 based Spark?
Unfortunately I don't...
@nightscape Can you provide me with a simple excel file which I can load into a dataframe using the maxRowsInMemory
option on versions 2.11:0.11.1 and above?
I'll further simplify my issue.
With spark-excel 2.11:013.1 - maxRowsInMemory
is not working for creating a dataframe from an excel file. I have attached all the detailed logs of all the exceptions / outputs I am receiving for this version (as well as for all the versions till 2.11:0.12.0 in the file spark_excel_runs.log in my previous message)
With spark-excel 2.11:0.11.1 - maxRowsInMemory
works perfectly well for creating a dataframe from that same excel file.
However, spark-excel 2.11:0.11.1 is not suitable for me because :
spark-defaults.conf
file and set the paths in the section spark-jars
to these downloaded jars. This helps us to avoid any case in which the original repository of these jars is changed. We have faced similar issues earlier.spark-jars
list. This gives an error :
Traceback (most recent call last):
File "C:/Users/as21616/Backup_Codes/excel_import.py", line 9, in <module>
load("file:///C:\\Users\\as21616\\Documents\\simple_data.xlsx")
File "C:\Users\as21616\Spark\python\lib\pyspark.zip\pyspark\sql\readwriter.py", line 166, in load
File "C:\Users\as21616\Spark\python\lib\py4j-0.10.7-src.zip\py4j\java_gateway.py", line 1257, in
__call__
File "C:\Users\as21616\Spark\python\lib\pyspark.zip\pyspark\sql\utils.py", line 79, in deco
pyspark.sql.utils.IllegalArgumentException: u'InputStream of class class
org.apache.commons.compress.archivers.zip.ZipFile$1 is not implementing InputStreamStatistics.'
Thus, unless maxRowsInMemory
works for 0.13.1 in my case, the only option I have is to use 0.11.1. I am aware that 0.13.1 solves issues related to shading packages. So @nightscape can you confirm if my understanding is correct that my only option is to wait for a newer version of spark-excel? Thanks!
Hey @anishsatalkar, thanks for the detailed report!
I would love to help (especially because you've gone to great lengths to help yourself), but I don't know when I'll find the time...
Btw, have you checked if the maxRowsInMemory
issue you have might be related to https://github.com/crealytics/spark-excel/issues/223?
@anishsatalkar can you try 0.13.2+1-e186f935-SNAPSHOT
? It's based on @pjfanning's fork of excel-stream-reader
and might fix your problem (wishful thinking :wink:).
Hi @nightscape , thanks for the suggestion. I used 0.13.2+1-e186f935-SNAPSHOT with all its dependencies (assuming that only excel-streaming-reader v2.3.3 and its dependencies were to be added). I still get only the top row in the dataframe from the excel file.
For the initial problem where I was getting an exception of py4j.protocol.Py4JJavaError: An error occurred while calling o83.load. : java.lang.NoClassDefFoundError: org/apache/commons/compress/utils/InputStreamStatistics
for spark-excel version 2.11:0.11.1, the following worked :
I had to add the path of commons-compress-1.18.jars
to spark.driver.extraClassPath
and spark.executor.extraClassPath
. Ideally, adding the path to spark.jars
should have sufficed as the documentation suggests that setting the path here adds the jar to the classpaths of both the driver and executor. However adding it explicitly to both the classpaths worked in my case.
Regarding maxRowsInMemory
not working for versions above 2.11:0.11.1, I still don't have a resolution on that and maybe a separate issue can be raised for it. But for now, ill be sticking with 2.11:0.11.1 and this case can be resolved now.
Thank you @nightscape for the help.
Ok, thanks for writing the summary! Might help others in the future :+1:
I found the solution.
For the initial problem where I was getting an exception of
py4j.protocol.Py4JJavaError: An error occurred while calling o83.load. : java.lang.NoClassDefFoundError: org/apache/commons/compress/utils/InputStreamStatistics
for spark-excel version 2.11:0.11.1, the following worked :I had to add the path of
commons-compress-1.18.jars
tospark.driver.extraClassPath
andspark.executor.extraClassPath
. Ideally, adding the path tospark.jars
should have sufficed as the documentation suggests that setting the path here adds the jar to the classpaths of both the driver and executor. However adding it explicitly to both the classpaths worked in my case.Regarding
maxRowsInMemory
not working for versions above 2.11:0.11.1, I still don't have a resolution on that and maybe a separate issue can be raised for it. But for now, ill be sticking with 2.11:0.11.1 and this case can be resolved now.Thank you @nightscape for the help.
Hi, I tried your suggestion. some how it is throwing error for me. Is my below command correct?
spark-submit --conf="spark.driver.extraClassPath=
ERROR: py4j.protocol.Py4JJavaError: An error occurred while calling o39.load. : java.lang.ClassNotFoundException: Failed to find data source: com.crealytics.spark.excel. Please find packages at http://spark.apache.org/third-party-projects.html
I would rather try sth. like
spark-submit --packages com.crealytics:spark-excel_2.11:0.12.0
I am trying to load a very simple excel file. I am using version 0.11.1 on pyspark 2.4.3. There doesn't seem to be a problem with the options I am providing since I am referring to the docs of v0.11.1.
I am getting the following error: (I have also added some additional info so that the debugging becomes easier).
Here is the spark.jars list I am providing in spark-defaults.conf file :
spark.jars /mnt/ojdbc6.jar,/mnt/mysql-connector-java.jar,/mnt/sqljdbc41.jar,/mnt/zs_edge_s3_direct_output_committer_2.11-1.0.jar,/usr/lib/hive/lib/hive-contrib.jar,/mnt/FuzzyJars/hive-udf-textmining-1.0-SNAPSHOT-jar-with-dependencies.jar,/mnt/postgresql-42.2.2.jar,/mnt/hive-hcatalog-core-1.1.1.jar,/mnt/scala-library-2.10.5.jar,/mnt/spark-xml_2.10-0.4.1.jar,/mnt/spark-sas7bdat-2.0.0-s_2.10.jar,/mnt/parso-2.0.7.jar,/mnt/terajdbc4.jar,/mnt/tdgssconfig.jar,/mnt/minimal-json-0.9.4.jar,/mnt/jackson-core-asl-1.9.13.jar,/mnt/commons-compress-1.18.jar,/mnt/avro-1.7.6.jar,/mnt/jackson-mapper-asl-1.9.13.jar,/mnt/paranamer-2.3.jar,/mnt/slf4j-api-1.7.25.jar,/mnt/spark-avro_2.11-2.4.3.jar,/mnt/snappy-java-1.0.5.jar,/mnt/xz-1.0.jar,/mnt/spark-redshift_2.11-3.0.0.jar,/mnt/scala-library-2.11.7.jar,/mnt/RedshiftJDBC42-1.2.7.1003.jar,/mnt/commons-codec-1.10.jar,/mnt/commons-collections4-4.2.jar,/mnt/curvesapi-1.04.jar,/mnt/icu4j-4.6.jar,/mnt/jackson-annotations-2.8.0.jar,/mnt/jackson-core-2.8.10.jar,/mnt/jackson-databind-2.8.10.jar,/mnt/java-cup-10k.jar,/mnt/joda-convert-2.0.1.jar,/mnt/joda-time-2.9.9.jar,/mnt/org.eclipse.wst.xml.xpath2.processor-2.1.100.jar,/mnt/poi-4.0.0.jar,/mnt/poi-ooxml-4.0.0.jar,/mnt/poi-ooxml-schemas-4.0.0.jar,/mnt/scala-library-2.11.11.jar,/mnt/scala-xml_2.11-1.1.1.jar,/mnt/slf4j-api-1.7.12.jar,/mnt/spark-excel_2.11-0.11.1.jar,/mnt/spoiwo_2.11-1.4.1.jar,/mnt/stax-api-1.0.1.jar,/mnt/xerces2-xsd11-2.11.1.jar,/mnt/xlsx-streamer-2.0.0.jar,/mnt/xml-apis-1.4.01.jar,/mnt/xml-resolver-1.2.jar,/mnt/xmlbeans-3.0.1.jar
Steps to Reproduce
pyspark
in the linux terminal.Paste the following in the shell:
Deleting the
spark.jars
list from spark-defaults.conf and then using the commandspark-submit --packages com.crealytics:spark-excel_2.11:0.11.1 excel_import.py
works fine as expected. excel_import.py contains the following snippet:My Environment
I am using an AWS EMR cluster with version 5.28.1. I am using spark-excel version 0.11.1 since it is the only version that I found which is supporting
maxRowsInMemory
correctly in this environment. I tried v0.13.1, but as mentioned earlier,maxRowsInMemory
is not working properly, ie : The dataframe is getting generated with just the first row.