What I learned from PySpark DataCamp courses - Flash Card

Introduction to PySpark

(Examples from DataCamp)

What is Spark? PySpark?

Create a PySpark session

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

PySpark with SQL queries

spark.sql(query)

Create dataframes
```
spark.createDataFrame(pd_df)
```
Import data from CSV (or parquet)

spark.read.csv(file_path, header=True)

Print dataframes

df.show()

df.show(10)

Create columns

flights.withColumn("duration_hrs", flights.air_time / 60)

Filter dataframes on conditions

flights.filter(flights.distance > 1000)

Select columns

flights.select(flights.origin, flights.dest, flights.carrier)

flights.select("origin", "dest", "tailnum", avg_speed)

Join dataframes

flights.join(airports, on="dest", how="leftouter")

Data cleaning with PySpark

Define a dataframe schema

from pyspark.sql.types import *
people_schema = StructType([
  StructField('name', StringType(), False),
  StructField('age', IntegerType(), False),
  StructField('city', StringType(), False)
])

Combine 2 dataframes into 1
```
df3 = df1.union(df2)
```

Save dataframe in parquet format

df3.write.parquet('AA_DFW_ALL.parquet', mode='overwrite')

Read parquet file into a new DataFrame and run a count
```
spark.read.parquet('file.parquet')
```
Create column on multiple conditions with when(), ortherwise()

class_df.withColumn('student',
                               when(class_df.grades < '10/20',
                               .when(class_df.sciences == 'maths', 'tutoring')
                               .otherwise('ok'))

UDF

def getFirstAndMiddle(names):
  return ' '.join(names)

udfFirstAndMiddle = udf(getFirstAndMiddle, StringType())
voter_df = voter_df.withColumn('first_and_middle_name', udfFirstAndMiddle(col("splits")))

Remove any rows containing fewer than 5 fields

annotations_df_filtered = annotations_df.filter(~ (annotations_df["colcount"] < 5))

opentargets / issues

Completing PySpark courses on datacamp #1893

What I learned from PySpark DataCamp courses - Flash Card

Introduction to PySpark

Data cleaning with PySpark