palantir / pyspark-style-guide

This is a guide to PySpark code style presenting common situations and the associated best practices based on the most frequent recurring topics across the PySpark repos we've encountered.
MIT License
987 stars 123 forks source link

Help on "Refactor complex logical operations" section #13

Closed marcosdotme closed 1 year ago

marcosdotme commented 1 year ago

I'm trying to reproduce a similar code to reduce the complexity of some logical clauses that I have in my code, but I didn't understand very well. Can someone give me help?

Based on this dataset, I need to create a column called "target", our target are all men from Utah.

from pyspark.sql import functions as F

data = [("James", "M", "Utah"),
        ("Michael", "M", "Oregon"),
        ("Maria", "F", "Utah"),
        ("Jennifer", "F", "Oregon"),
        ("Robert", "M", "Utah")]

columns = ["name", "gender", "state"]
df = spark.createDataFrame(data = data, schema = columns)

What I tried to do:

is_men = df.filter(F.col("gender") == "M")
from_utah = df.filter(F.col("state") == "Utah")

df.withColumn("target", F.when((is_men & from_utah), "Yes")).show()

This code above didn't work and raised an error:

TypeError: unsupported operand type(s) for &: 'DataFrame' and 'DataFrame'

Following exactly the example in "Refactor complex logical operations" section, the code must be something like this:

is_men = (F.col("gender") == "M")
from_utah = (F.col("state") == "Utah")

df.withColumn("target", F.when((is_men & from_utah), "Yes")).show()

But it also don't work and don't make sense to me. Wheres the dataframe reference for column "gender" and "state"?

cheTesta commented 1 year ago

@marcosdotme Hi. This question might be better off on StackOverflow. Anyhow.

But it also don't work and don't make sense to me. Wheres the dataframe reference for column "gender" and "state"?

I cannot comment on making sense, but it definetly work.

is_men = (F.col("gender") == "M")
from_utah = (F.col("state") == "Utah")
df.withColumn("marker", F.when(is_men & from_utah, True).otherwise(False)).show()

results in

> +--------+------+------+------+
> |    name|gender| state|marker|
> +--------+------+------+------+
> |   James|     M|  Utah|   Yes|
> | Michael|     M|Oregon|    No|
> |   Maria|     F|  Utah|    No|
> |Jennifer|     F|Oregon|    No|
> |  Robert|     M|  Utah|   Yes|
> +--------+------+------+------+

Check out the type of is_men

type(is_men)
> pyspark.sql.column.Column

Not actually a data item, while, your first formulation of your two objects where returning actual dataframes!

is_men = df.filter(F.col("gender") == "M")
type(is_men)
> pyspark.sql.dataframe.DataFrame
marcosdotme commented 1 year ago

Thank you so much for the reply @cheTesta! I was really trying to do something wrong due to a lack of knowledge on PySpark.

I will close this issue.