pyjanitor-devs / pyjanitor

Clean APIs for data cleaning. Python implementation of R package Janitor
https://pyjanitor-devs.github.io/pyjanitor
MIT License
1.33k stars 166 forks source link

[ENH] let `fill_empty` function support to fill NaN value with mean, median or mode #1044

Open Zeroto521 opened 2 years ago

Zeroto521 commented 2 years ago

Brief Description

As title. For some data, such as GDP, filling its NaN value with 0 isn't a good idea. Because most of the GDP values end in million. We don't fill NaN value with 0 rather mean value.

API

def fill_empty(
    df: pd.DataFrame,
    column_names: list[str | int],
    value: Any = None,
    method: str = None,
) -> pd.DataFrame:
    ...
  1. One of value and method shouldn't be None.
  2. The method should be 'mean', 'median', or 'mode'.

Example

import pandas as pd
import janitor  # noqa

# create a DataFrame
df = pd.Series([2, 2, None, 0, 4], name="nan-col").to_frame()
#    nan-col
# 0      2.0
# 1      2.0
# 2      NaN
# 3      0.0
# 4      4.0

# fill NaN with mean value
df.fill_empty(["nan-col"], method="mean")
#    nan-col
# 0      2.0
# 1      2.0
# 2      2.0
# 3      0.0
# 4      4.0
samukweku commented 2 years ago

@Zeroto521 impute covers this usecase; at this point, I wonder if it is okay to deprecate one of these functions, so we have just one that covers na filling? @pyjanitor-devs/core-devs

thatlittleboy commented 2 years ago

@samukweku I'm okay with deprecating one of impute or fill_empty. Seems like impute not only covers the "mean/mode/.." use case, but also the imputing with constant value, which is fill_empty's current functionality?

I'll be inclined to keep impute over fill_empty (at least within the DS/ML community, impute is a commonly-used term; not sure about the broader data world.)

samukweku commented 2 years ago

yea, impute is a wrapper around fillna, with the benefits of the statistics imputation.