pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.74k stars 17.95k forks source link

ENH: df.compare should have tolerances #48488

Open tehunter opened 2 years ago

tehunter commented 2 years ago

Feature Type

Problem Description

I would like to have df.compare accept "tolerance" thresholds to allow for approximate comparisons. This feature already exists in the assert_frame_equal utility, and it would be beneficial in compare to help users identify the rows and columns that are causing their assertion to fail. It would be helpful in many cases to allow users to filter out differences that are sufficiently small.

Feature Description

    def compare(
        self,
        other: DataFrame,
        align_axis: Axis = 1,
        keep_shape: bool = False,
        keep_equal: bool = False,
        rtol = None,
        atol = None,
    ) -> DataFrame:
    """
    ...
    rtol: float | None, default None
        Relative tolerance. Numeric differences below this value will not be considered differences for the purposes of "keep_shape" and will be shown as NaN if "keep_equal" is False.

    atol: float | None, default None
        Absolute tolerance. Numeric differences below this value will not be considered differences for the purposes of "keep_shape" and will be shown as NaN if "keep_equal" is False.

For implementation, the current comparison is essentially the following check: mask = ~((self == other) | (self.isna() & other.isna())). From a quick glance of _testing.assert_almost_equal, it appears we could implement it by calling that function iteratively with each item of the DataFrame, though I'm not sure if it's okay to reference the _testing library outside of testing functions.

Alternative Solutions

Could also be implemented more directly with math.isclose function calls, but this would need to be applied only to numeric columns.

Additional Context

No response

CompRhys commented 1 year ago

BUMP