A library for reading SAS data (.sas7bdat) with Spark.
The latest jar can be downloaded from spark-packages.
Version | Scala Version | Spark Version |
---|---|---|
3.0.0-s_2.11 | 2.11.x | 2.4.x |
3.0.0-s_2.12 | 2.12.x | 3.0.x |
NOTE: this package does not support writing sas7bdat files
extractLabel
(Default: false
)
forceLowercaseNames
(Default: false
)
inferDecimal
(Default: false
)
inferDecimalScale
(Default: each column's format width
)
inferFloat
(Default: false
)
inferInt
(Default: false
)
inferLong
(Default: false
)
inferShort
(Default: false
)
metadataTimeout
(Default: 60
)
minSplitSize
(Default: mapred.min.split.size
)
maxSplitSize
(Default: mapred.max.split.size
)
NOTE:
val df = {
spark.read
.format("com.github.saurfang.sas.spark")
.option("forceLowercaseNames", true)
.option("inferLong", true)
.load("cars.sas7bdat")
}
df.write.format("csv").option("header", "true").save("newcars.csv")
You can also use the implicit readers:
import com.github.saurfang.sas.spark._
// DataFrameReader
val df = spark.read.sas("cars.sas7bdat")
df.write.format("csv").option("header", "true").save("newcars.csv")
// SQLContext
val df2 = sqlContext.sasFile("cars.sas7bdat")
df2.write.format("csv").option("header", "true").save("newcars.csv")
(Note: you cannot use parameters like inferLong
with the implicit readers.)
df = spark.read.format("com.github.saurfang.sas.spark").load("cars.sas7bdat", forceLowercaseNames=True, inferLong=True)
df.write.csv("newcars.csv", header=True)
df <- read.df("cars.sas7bdat", source = "com.github.saurfang.sas.spark", forceLowercaseNames = TRUE, inferLong = TRUE)
write.df(df, path = "newcars.csv", source = "csv", header = TRUE)
SAS data can be queried in pure SQL by registering the data as a (temporary) table.
CREATE TEMPORARY VIEW cars
USING com.github.saurfang.sas.spark
OPTIONS (path="cars.sas7bdat")
We included a simple SasExport
Spark program that converts .sas7bdat to .csv or .parquet files:
sbt "run input.sas7bdat output.csv"
sbt "run input.sas7bdat output.parquet"
To achieve more parallelism, use spark-submit
script to run it on a Spark cluster. If you don't have a spark
cluster, you can always run it in local mode and take advantage of multi-core.
spark-shell --master local[4] --packages saurfang:spark-sas7bdat:3.0.0-s_2.12
spark-csv
writes out null
as "null" in csv text output. This means if you read it back for a string type,
you might actually read "null" instead of null
. The safest option is to export in parquet format where
null is properly recorded. See https://github.com/databricks/spark-csv/pull/147 for alternative solution.This project would not be possible without parso continued improvements and generous contributions from @mulya, @thesuperzapper, and many others. We are hornored to be a recipient of 2020 WiseWithData ELEVATE Awards and appreciate their generous donations.