palantir / pyspark-style-guide

This is a guide to PySpark code style presenting common situations and the associated best practices based on the most frequent recurring topics across the PySpark repos we've encountered.
MIT License
1.02k stars 131 forks source link

Select before joining #14

Open raphaelgl opened 1 year ago

raphaelgl commented 1 year ago

To be consistent with the concept of selecting first the input needed for a transform, should we also recommend doing that before a join . This would mean the good join option being :

# good
flights = flights.select(
    F.col('start_time').alias('flight_start_time'),
    F.col('end_time').alias('flight_end_time'), 
    'flight_code')

parking = parking.select(
    F.col('total_time').alias('client_parking_total_time'),
    'flight_code')

flights = flights.join(parking, on='flight_code', how='left')