ENH: add `atol` to pd.DataFrame.compare()

JonahBreslow commented 1 year ago

Feature Type

[X] Adding new functionality to pandas
[X] Changing existing functionality in pandas
[ ] Removing existing functionality in pandas

Problem Description

When comparing pandas dataframes with floating point numbers, it can be extremely useful to compare with an absolute tolerance (atol) as we see in pandas.testing.assert_frame_equal.

Feature Description

I propose we add an argument to the function signature of pd.DataFrame.compare() as follows:

class DataFrame(NDFrame, OpsMixin):
    def __init__(...)
...
    def compare(self, ..., atol: float = None)
        # implement code to compare numeric comparison with tolerance

Alternative Solutions

This is some workaround code that works for my specific use case, but is most definitely not general

def deep_compare(
    df1: pd.DataFrame, df2: pd.DataFrame, atol: float
) -> pd.DataFrame:
    """Compare two pandas dataframes at a deep level. This will
    return a dataframe with the differences between the two frames
    explicitly shown.

    Args:
        df1 (pd.DataFrame): The left dataframe
        df2 (pd.DataFrame): The right dataframe
        atol (float): Absolute tolerance

    Returns:
        pd.DataFrame: A dataframe with the differences between the two frames
    """
    diff_df = pd.DataFrame(index=df1.index, columns=df1.columns)
    for col in df1.columns:
        if check_cols_are_numeric(df1, df2, col):
            diff_df[col] = tolerance_compare(df1, df2, atol, col)
        else:
            diff_df[col] = exact_compare(df1, df2, col)

    diff_df = remove_rows_cols_all_na(diff_df)
    diff_colums = diff_df.columns
    right_df = df2[diff_colums]

    diff_df = diff_df.merge(
        right_df, left_index=True, right_index=True, suffixes=("_pg", "_snf")
    )

    return diff_df

def exact_compare(
    df1: pd.DataFrame, df2: pd.DataFrame, col: str
) -> np.ndarray:
    return np.where(df1[col] != df2[col], df1[col], np.nan)

def tolerance_compare(
    df1: pd.DataFrame, df2: pd.DataFrame, atol: float, col: str
) -> np.ndarray:
    return np.where(np.abs(df1[col] - df2[col]) > atol, df1[col], np.nan)

def remove_rows_cols_all_na(diff_df: pd.DataFrame) -> pd.DataFrame:
    diff_df = diff_df.dropna(how="all")
    diff_df = diff_df.dropna(axis=1, how="all")
    return diff_df

def check_cols_are_numeric(
    df1: pd.DataFrame, df2: pd.DataFrame, col: str
) -> bool:
    return pd.api.types.is_numeric_dtype(
        df1[col]
    ) and pd.api.types.is_numeric_dtype(df2[col])

Additional Context

No response

aanilpala commented 1 year ago

I'd rather use is_any_real_numeric_dtype to avoid tolerance comparison on boolean vals

tomhoq commented 6 months ago

@mroeschke Hi! I would love to work on this enhancement, would it be ok to start working on it even if it has not yet been reviewed? Also if someone could in the meanwhile review it I would appreciate.

Thank you!

mroeschke commented 6 months ago

I would say any issue that has not been triaged yet should not be worked on until a core team member has reviewed the issue

pandas-dev / pandas