Closed DSuveges closed 2 years ago
(Examples from DataCamp)
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
spark.sql(query)
Create dataframes
spark.createDataFrame(pd_df)
Import data from CSV (or parquet)
spark.read.csv(file_path, header=True)
df.show()
or
df.show(10)
flights.withColumn("duration_hrs", flights.air_time / 60)
flights.filter(flights.distance > 1000)
flights.select(flights.origin, flights.dest, flights.carrier)
or
flights.select("origin", "dest", "tailnum", avg_speed)
flights.join(airports, on="dest", how="leftouter")
from pyspark.sql.types import *
people_schema = StructType([
StructField('name', StringType(), False),
StructField('age', IntegerType(), False),
StructField('city', StringType(), False)
])
Combine 2 dataframes into 1
df3 = df1.union(df2)
Save dataframe in parquet format
df3.write.parquet('AA_DFW_ALL.parquet', mode='overwrite')
Read parquet file into a new DataFrame and run a count
spark.read.parquet('file.parquet')
Create column on multiple conditions with when(), ortherwise()
class_df.withColumn('student',
when(class_df.grades < '10/20',
.when(class_df.sciences == 'maths', 'tutoring')
.otherwise('ok'))
def getFirstAndMiddle(names):
return ' '.join(names)
udfFirstAndMiddle = udf(getFirstAndMiddle, StringType())
voter_df = voter_df.withColumn('first_and_middle_name', udfFirstAndMiddle(col("splits")))
annotations_df_filtered = annotations_df.filter(~ (annotations_df["colcount"] < 5))
Completing these courses will provide the sufficient technical knowledge for the internship: