opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal
https://platform.opentargets.org https://genetics.opentargets.org
Apache License 2.0
12 stars 2 forks source link

Completing PySpark courses on datacamp #1893

Closed DSuveges closed 2 years ago

DSuveges commented 2 years ago

Completing these courses will provide the sufficient technical knowledge for the internship:

MarineGirardey commented 2 years ago

What I learned from PySpark DataCamp courses - Flash Card

Introduction to PySpark

(Examples from DataCamp)

Spark PySpark

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
spark.sql(query)
spark.read.csv(file_path, header=True)
df.show()

or

df.show(10)
flights.withColumn("duration_hrs", flights.air_time / 60)
flights.filter(flights.distance > 1000)
flights.select(flights.origin, flights.dest, flights.carrier)

or

flights.select("origin", "dest", "tailnum", avg_speed)
flights.join(airports, on="dest", how="leftouter")

Data cleaning with PySpark

from pyspark.sql.types import *
people_schema = StructType([
  StructField('name', StringType(), False),
  StructField('age', IntegerType(), False),
  StructField('city', StringType(), False)
])
class_df.withColumn('student',
                               when(class_df.grades < '10/20',
                               .when(class_df.sciences == 'maths', 'tutoring')
                               .otherwise('ok'))
def getFirstAndMiddle(names):
  return ' '.join(names)

udfFirstAndMiddle = udf(getFirstAndMiddle, StringType())
voter_df = voter_df.withColumn('first_and_middle_name', udfFirstAndMiddle(col("splits")))