section-engineering-education / engineering-education

“Section's Engineering Education (EngEd) Program is dedicated to offering a unique quality community experience for computer science university students."
Apache License 2.0
363 stars 890 forks source link

[Languages]Creating a PySpark DataFrame: A Beginner's Guide #5003

Closed FranciscaNg closed 2 years ago

FranciscaNg commented 2 years ago

Topic Suggestion

Creating a PySpark DataFrame: A Beginner's Guide

Proposed article introduction

We can distribute data and conduct calculations on several nodes of a cluster using Spark, a cluster computing platform. It is easier to process large datasets when the data is distributed. In this example, each node is described as a discrete computer that is focused on a certain subset of data. This node would also be responsible for some of the calculations that take place during dataset operations. In addition to Scala, Spark also supports Java, Python, R and SQL programming languages.

Key takeaways

Article quality

In this lesson, we will learn about creating PySpark DataFrames. PySpark DataFrames and two techniques for creating them will be covered in this article. This article differs from others in that I'll generate a PySpark DataFrame from an existing RDD, generate PySpark DataFrame from List and from external file sources using Google Colaboratory for practice. As a bonus, we'll cover certain approaches that aren't already covered in the published papers on PySpark. The article will be comprehensive, and the code provided will be simple to grasp.

Reference

N/A

Templates to use as guides

lalith1403 commented 2 years ago

The explanation isn’t sufficient. Please explain in detail, how this article will add value to the reader.

FranciscaNg commented 2 years ago

@lalith1403 Done Thanks.