This is a guide to PySpark code style presenting common situations and the associated best practices based on the most frequent recurring topics across the PySpark repos we've encountered.
When defining a function, it would be useful to follow a convention for PySpark DataFrame typehints, e.g.
from pyspark.sql import DataFrame
import pyspark.pandas as ps
def my_function(my_dataframe: DataFrame) -> ps.DataFrame:
return my_dataframe.toPandas()
However the above doesn't clearly distinguish between the different data types. Perhaps an alias for the pyspark.sql.DataFrame is required- although I'm not sure of how to make it different from ps.DataFrame (an established alias).
When defining a function, it would be useful to follow a convention for PySpark DataFrame typehints, e.g.
However the above doesn't clearly distinguish between the different data types. Perhaps an alias for the pyspark.sql.DataFrame is required- although I'm not sure of how to make it different from ps.DataFrame (an established alias).