pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.25k stars 17.79k forks source link

ENH: In pandas.testing.assert_frame_equal, support per-column configuration #59548

Open adrian17 opened 3 weeks ago

adrian17 commented 3 weeks ago

Feature Type

Problem Description

Our internal validation tool's tolerance needs to depend on compared metrics. For example, when obtaining results from an analytical database from a query like

SELECT count(distinct device_id) as device_count, avg(score) as score GROUP BY ...

We expect device_count to always be accurate, but score is expected to have random numerical floating point inaccuracies.

My old code ran assert_frame_equal several times on different subsets of columns, which is cumbersome and doesn't express the intent well. I recently refactored it by extracting assert_frame_equal's implementation and just adding the extra arguments to support per-column customizable rtol and atol. It would be nice if such an ability was built into Pandas.

Note that this overlaps a bit with feature request https://github.com/pandas-dev/pandas/issues/54861 .

Feature Description

One way is to add extra arguments to assert_frame_equal, usable like so:

assert_frame_equal(
    left,
    right,
    rtol=1e-5,
    atol=1e-8,
    rtols={'device_count': 0, 'score': 1e-6},
    atols={'device_count': 0}, # for unspecified columns, the rtol/atol argument is used as default
)

Or the entire comparison configuration (check_exact, check_datetimelike_compat etc) could be overridden per-series, for example

assert_frame_equal(
    left,
    right,
    overrides={
        'device_count': {'check_exact': True},
        'score': {'rtol': 1e-6},
    }
)

Alternative Solutions

The current way to do it with public APIs is to do something like

for column_names, rtol in [(["device_count", ...], 0.0), (["score", ...], 1e-6), ...]:
    left = # extract index and columns from left
    right = # extract index and columns from right
    assert_frame_equal(left, right, rtol=rtol)
specialkapa commented 3 weeks ago

take

rhshadrach commented 2 weeks ago

Thanks for the request!

Compared to DataFrame methods, what makes assert_frame_equal unique in that it should support by-column arguments?

It does not seem to me to be maintainable to allow by-column specific arguments across the API for DataFrame methods, and therefore we should not do so here for API consistency. The alternative solution in the OP appears to me to be the right, sustainable, approach.

specialkapa commented 2 weeks ago

Hi @rhshadrach. Thanks for the comment. I have done some work on this and I think the solution I've come up with is sustainable going forward. Just ironing out a few details. I got a few tests failing but they appear irrelevant to 'assert_frame_equal'. I should be ready to open a PR this week. Perhaps we can discuss if the solution compatible with the API on the PR review section?

rhshadrach commented 2 weeks ago

@specialkapa - without an answer to the above question, I am opposed to adding this feature. The issue I have with sustainability is not for this one particular feature, but rather having to add similar things to other methods for DataFrames.

specialkapa commented 2 weeks ago

That is a good point. Thanks for getting back to me.

On Sun, 25 Aug 2024 at 16:30, Richard Shadrach @.***> wrote:

@specialkapa https://github.com/specialkapa - without an answer to the above question, I am opposed to adding this feature. The issue I have with sustainability is not for this one particular feature, but rather having to add similar things to other methods for DataFrames.

— Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/59548#issuecomment-2308896936, or unsubscribe https://github.com/notifications/unsubscribe-auth/AW7JCDIJBMYKVBH5VJY2IK3ZTH2DFAVCNFSM6AAAAABMX4CA62VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMBYHA4TMOJTGY . You are receiving this because you were mentioned.Message ID: @.***>