jekwatt / idiomatic_pandas

Tips and tricks for the most common data handling task with pandas.
0 stars 0 forks source link

Handling nulls #20

Open jekwatt opened 3 years ago

jekwatt commented 3 years ago

https://datatofish.com/check-nan-pandas-dataframe/

4 ways to check for NaN in Pandas DataFrame:

1.

# check for NaN under a single DataFrame column:
df['col_1'].isnull().values.any()

2.

# count the NaN under a single DataFrame column:
# False -> 0, True -> 1
df['col_1'].isnull().sum()

3.

# check for NaN under an entire DataFrame:
df.isnull().values.any()

# only view missing values
df[df["col_1"].isnull()]

4.

# count the NaN under an entire DataFrame:
df.isnull().sum()  # sum of each row
df.isnull().sum().sum()
jekwatt commented 3 years ago

dropna

# drop the rows that contain a missing value
df.dropna(how="any")
df.dropna(how="all")

df.dropna(subset=["col_1", "col_2"], how="any")
df.dropna(subset=["col_1", "col_2"], how="all")

fillna

df["col_1"].fillna(value="MISSING VALUE", inplace=True)
df["col_1"].value_counts(dropna=False)
jekwatt commented 3 years ago

na_values

# provide list
df = pd.read_csv("file.csv", na_values=["not available", "n.a."])

# provide dictionary
df = pd.read_csv("file.csv", na_values={
    "col_1": ["not available", "n.a.", -1],
    "col_2": ["not available", "n.a."],
})

Use Flag when creating CSV file:

# not interpret NA strings at load time
# eg "", NA, N/A, NaN
df.to_csv(csv_path, keep_default_na=False)
jekwatt commented 2 years ago

Answers from Cameron:

pandas developers began searching for a solution ~2 years ago and introduced the pandas.NA value (instead of solely relying on numpy.nan). 
The pandas.NA value allows your column to retain other dtypes aside from float and object (which np.nan forces).

A trick you can use is to temporarily mask away the nan's via `df.loc[df['col'].notnull(), 'col']` and 
apply an operation to that subset.