Closed SrikaranJangidi closed 3 years ago
@SrikaranJangidi This package should work with all versions of Spark 2, can you describe the error you are getting. (Logs, etc)
@thesuperzapper Thanks for responding. Below is the information. Please let us know if more information is needed.
We are trying to read this file: datetime.sas7bdat which is available on this GitHub page: https://github.com/saurfang/spark-sas7bdat and getting this error: Code used:
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("yarn").appName("sort data")\ .config("spark.jars.packages","saurfang:spark-sas7bdat:2.1.0-s_2.11")\ .config('spark.driver.memory','3g')\ .config('spark.executor.memory','3g')\ .config('spark.executor.cores','2')\ .config('spark.executor.instances', '1')\ .getOrCreate()
spark.read.format("com.github.saurfang.sas.spark").load("/hdfs/path/datetime.sas7bdat",inferLong=True)
Py4JJavaError Traceback (most recent call last)
@SrikaranJangidi that error is usually associated with a corrupt file, can you:
@thesuperzapper I checked with my developer and below is his response. He tried with another file as well. Please advise.
The file is not corrupted as I am able to read it using pandas.
import pandas as pd df = pd.read_sas('datetime.sas7bdat',format='sas7bdat')
Also, I tried to read another file: ag121a_supp.sample which is of size 715kb and still getting the same error. This file is also available on GitHub: https://github.com/saurfang/spark-sas7bdat/tree/master/src/test/resources
Py4JJavaError: An error occurred while calling o120.load.
: java.util.concurrent.TimeoutException: Timed out after 60 sec while reading file metadata, file might be corrupt. (Change timeout with 'metadataTimeout' paramater)
at com.github.saurfang.sas.spark.SasRelation.inferSchema(SasRelation.scala:189)
at com.github.saurfang.sas.spark.SasRelation.
import os os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars /opt/anaconda3/lib/python3.7/site-packages/pyspark/jars/spark-sas7bdat-2.1.0-s_2.11.jar pyspark-shell'
import findspark findspark.init("/opt/cloudera/parcels/SPARK2/lib/spark2/")
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("yarn").appName("sup_data")\ .config("spark.jars.packages","saurfang:spark-sas7bdat:2.1.0-s_2.11")\ .config('spark.driver.memory','3g')\ .config('spark.executor.memory','3g')\ .config('spark.executor.cores','2')\ .config('spark.executor.instances', '1')\ .getOrCreate()
df = spark.read.format("com.github.saurfang.sas.spark").load("ag121a_supp_sample.sas7bdat")
The file is not corrupted as I am able to read it using pandas.
import pandas as pd df = pd.read_sas('datetime.sas7bdat',format='sas7bdat')
Also, I tried to read another file: ag121a_supp.sample which is of size 715kb and still getting the same error. This file is also available on GitHub: https://github.com/saurfang/spark-sas7bdat/tree/master/src/test/resources
Py4JJavaError: An error occurred while calling o120.load.
: java.util.concurrent.TimeoutException: Timed out after 60 sec while reading file metadata, file might be corrupt. (Change timeout with 'metadataTimeout' paramater)
at com.github.saurfang.sas.spark.SasRelation.inferSchema(SasRelation.scala:189)
at com.github.saurfang.sas.spark.SasRelation.
import os os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars /opt/anaconda3/lib/python3.7/site-packages/pyspark/jars/spark-sas7bdat-2.1.0-s_2.11.jar pyspark-shell'
import findspark findspark.init("/opt/cloudera/parcels/SPARK2/lib/spark2/")
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("yarn").appName("sup_data")\ .config("spark.jars.packages","saurfang:spark-sas7bdat:2.1.0-s_2.11")\ .config('spark.driver.memory','3g')\ .config('spark.executor.memory','3g')\ .config('spark.executor.cores','2')\ .config('spark.executor.instances', '1')\ .getOrCreate()
df = spark.read.format("com.github.saurfang.sas.spark").load("ag121a_supp_sample.sas7bdat") The file is not corrupted as I am able to read it using pandas.
import pandas as pd df = pd.read_sas('datetime.sas7bdat',format='sas7bdat')
Also, I tried to read another file: ag121a_supp.sample which is of size 715kb and still getting the same error. This file is also available on GitHub: https://github.com/saurfang/spark-sas7bdat/tree/master/src/test/resources
Py4JJavaError: An error occurred while calling o120.load.
: java.util.concurrent.TimeoutException: Timed out after 60 sec while reading file metadata, file might be corrupt. (Change timeout with 'metadataTimeout' paramater)
at com.github.saurfang.sas.spark.SasRelation.inferSchema(SasRelation.scala:189)
at com.github.saurfang.sas.spark.SasRelation.
import os os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars /opt/anaconda3/lib/python3.7/site-packages/pyspark/jars/spark-sas7bdat-2.1.0-s_2.11.jar pyspark-shell'
import findspark findspark.init("/opt/cloudera/parcels/SPARK2/lib/spark2/")
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("yarn").appName("sup_data")\ .config("spark.jars.packages","saurfang:spark-sas7bdat:2.1.0-s_2.11")\ .config('spark.driver.memory','3g')\ .config('spark.executor.memory','3g')\ .config('spark.executor.cores','2')\ .config('spark.executor.instances', '1')\ .getOrCreate()
df = spark.read.format("com.github.saurfang.sas.spark").load("ag121a_supp_sample.sas7bdat")
Srikaran
Hi. Any update on this please?
Thanks Srkaran.
Hi, I am running into this issue as well.
java.util.concurrent.TimeoutException: Timed out after 60 sec while reading file metadata
Thanks, Terry
Make sure the parso library is available, thought I wouldn't expect that's the cause of this particular error. FWIW I haven't had any problems with this library on Spark 2.2, 2.3, or 2.4.
Hello - Thanks for the comment re: parso library inclusion. After I included parso, I now get the following:
py4j.protocol.Py4JJavaError: An error occurred while calling o66.load. : java.lang.ClassCastException: java.util.Arrays$ArrayList cannot be cast to java.util.Set
Hi - The above was caused by using parso version < 2.0.10. It is imperative that 2.0.10 is being used. I missed that requirement. The issue above has been resolved after referencing parso 2.0.10.
Also, initially I did not reference parso jar in the execution. Therefore it was producing:
java.util.concurrent.TimeoutException: Timed out after 60 sec while reading file metadata
After referencing parso, the above exception went away.
importing parso 2.0.10, with spark 2.4.5 and restarting fixed this type of error for me. This works really well now!
Can anyone tell me how to download and import Parso? I am getting this same issue and think it is because I don't have parso installed properly
@Speccles96, just download the jar from maven and pass it with spark-submit --jars
You can find the link at the top of the README.md under "requirements"
EDIT: or if your spark cluster has internet, you can pass the maven coordinates with spark-submit --packages
@thesuperzapper Maybe I'm not fully understanding, but how would I run spark-submit --jars in my python script? What I am doing is starting up a jupyter notebook and running the code below. I put the parso-2.0.10.jar in my Java\lib path with all of the other .jars
Spark Version:2.4.6 Scala Version:2.12.2 Java Version:1.8.0_261
import findspark
findspark.init()
from pyspark.sql.session import SparkSession
spark = SparkSession.builder.\
config("spark.jars.packages","saurfang:spark-sas7bdat:2.0.0-s_2.11")\
.enableHiveSupport().getOrCreate()
df=spark.read.format('com.github.saurfang.sas.spark')\
.load(r'D:\IvyDB\opprcd\opprcd2019.sas7bdat')
I am trying to replicate what was done in this article http://blog.rubypdf.com/2018/10/12/how-two-read-sas-data-with-pyspark/
@saurfang can you close this?
@thesuperzapper I added you as a collaborator and you shall feel free to close any issues that you feel are already addressed or stale. ty!
Hi We have a spark cluster with Spark 2.3.0. The jar spark-sas7bdat-2.1.0-s_2.11.jar is not working for Spark 2.3.0 and it seems it will work for Spark 2.2.0 Please suggest if there is a work around for Spark 2.3.0 or we have to downgrade spark version to Spark 2.2.0