imsanjoykb / Data-Science-Regular-Bootcamp

Regular practice on Data Science, Machien Learning, Deep Learning, Solving ML Project problem, Analytical Issue. Regular boost up my knowledge. The goal is to help learner with learning resource on Data Science filed.
https://imsanjoykb.github.io/
MIT License
106 stars 40 forks source link

Read File in pyspark #160

Open imsanjoykb opened 2 years ago

imsanjoykb commented 2 years ago
# spark is from the previous example
sc = spark.sparkContext

# A text dataset is pointed to by path.
# The path can be either a single text file or a directory of text files
path = "examples/src/main/resources/people.txt"

df1 = spark.read.text(path)
df1.show()
# +-----------+
# |      value|
# +-----------+
# |Michael, 29|
# |   Andy, 30|
# | Justin, 19|
# +-----------+

# You can use 'lineSep' option to define the line separator.
# The line separator handles all `\r`, `\r\n` and `\n` by default.
df2 = spark.read.text(path, lineSep=",")
df2.show()
# +-----------+
# |      value|
# +-----------+
# |    Michael|
# |   29\nAndy|
# | 30\nJustin|
# |       19\n|
# +-----------+

# You can also use 'wholetext' option to read each input file as a single row.
df3 = spark.read.text(path, wholetext=True)
df3.show()
# +--------------------+
# |               value|
# +--------------------+
# |Michael, 29\nAndy...|
# +--------------------+

# "output" is a folder which contains multiple text files and a _SUCCESS file.
df1.write.csv("output")

# You can specify the compression format using the 'compression' option.
df1.write.text("output_compressed", compression="gzip")