pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.73k stars 17.95k forks source link

ENH: add `atol` to pd.DataFrame.compare() #54677

Open JonahBreslow opened 1 year ago

JonahBreslow commented 1 year ago

Feature Type

Problem Description

When comparing pandas dataframes with floating point numbers, it can be extremely useful to compare with an absolute tolerance (atol) as we see in pandas.testing.assert_frame_equal.

Feature Description

I propose we add an argument to the function signature of pd.DataFrame.compare() as follows:

class DataFrame(NDFrame, OpsMixin):
    def __init__(...)
...
    def compare(self, ..., atol: float = None)
        # implement code to compare numeric comparison with tolerance

Alternative Solutions

This is some workaround code that works for my specific use case, but is most definitely not general

def deep_compare(
    df1: pd.DataFrame, df2: pd.DataFrame, atol: float
) -> pd.DataFrame:
    """Compare two pandas dataframes at a deep level. This will
    return a dataframe with the differences between the two frames
    explicitly shown.

    Args:
        df1 (pd.DataFrame): The left dataframe
        df2 (pd.DataFrame): The right dataframe
        atol (float): Absolute tolerance

    Returns:
        pd.DataFrame: A dataframe with the differences between the two frames
    """
    diff_df = pd.DataFrame(index=df1.index, columns=df1.columns)
    for col in df1.columns:
        if check_cols_are_numeric(df1, df2, col):
            diff_df[col] = tolerance_compare(df1, df2, atol, col)
        else:
            diff_df[col] = exact_compare(df1, df2, col)

    diff_df = remove_rows_cols_all_na(diff_df)
    diff_colums = diff_df.columns
    right_df = df2[diff_colums]

    diff_df = diff_df.merge(
        right_df, left_index=True, right_index=True, suffixes=("_pg", "_snf")
    )

    return diff_df

def exact_compare(
    df1: pd.DataFrame, df2: pd.DataFrame, col: str
) -> np.ndarray:
    return np.where(df1[col] != df2[col], df1[col], np.nan)

def tolerance_compare(
    df1: pd.DataFrame, df2: pd.DataFrame, atol: float, col: str
) -> np.ndarray:
    return np.where(np.abs(df1[col] - df2[col]) > atol, df1[col], np.nan)

def remove_rows_cols_all_na(diff_df: pd.DataFrame) -> pd.DataFrame:
    diff_df = diff_df.dropna(how="all")
    diff_df = diff_df.dropna(axis=1, how="all")
    return diff_df

def check_cols_are_numeric(
    df1: pd.DataFrame, df2: pd.DataFrame, col: str
) -> bool:
    return pd.api.types.is_numeric_dtype(
        df1[col]
    ) and pd.api.types.is_numeric_dtype(df2[col])

Additional Context

No response

aanilpala commented 1 year ago

I'd rather use is_any_real_numeric_dtype to avoid tolerance comparison on boolean vals

tomhoq commented 6 months ago

@mroeschke Hi! I would love to work on this enhancement, would it be ok to start working on it even if it has not yet been reviewed? Also if someone could in the meanwhile review it I would appreciate.

Thank you!

mroeschke commented 6 months ago

I would say any issue that has not been triaged yet should not be worked on until a core team member has reviewed the issue