Implement Relative Criterion for Differences of Numeric Columns

bundfussr commented 1 year ago

For columns with a wide range of the magnitude, e.g., ratios the comparison does not work well if tolerance is used. I would suggest to implement a relative criterion like in proc compare (https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/proc/p0bbu58eqgufwzn16zafm1hvzfw2.htm#p0jwvxsipe3919n1n0areynrgdok).

gowerc commented 1 year ago

We are planning a minor overhaul of the package in Q4 so will look to include this change in then. Thank you for raising.

bundfussr commented 1 year ago

Great, looking forward to the new version.

gowerc commented 3 months ago

Apologies a bit of a rambly post but just wanted to share my current understanding of this.

First off I want to document the behaviour of base R's all.equal for reference (mostly because this confused me and I want to have this available to look back at later when I inevitably forget)

General syntax is:

all.equal(target, current, tolerance, scale)

if scale is not null then it will always use absolute differences
if scale is null then:
- if mean(abs(target)) <= tolerance will use absolute differences (with scale set to 1)
- if mean(abs(target)) > tolerance will use relative differences

Just to be explicit:

absolute differences are abs( (target - current) / scale ) > tolerance
relative differences are abs( (target - current) / mean(abs(target)) ) > tolerance

There is a good stack overflow post on this here which explains what I've written above and shows how this deviates from the documentation.

In terms of implementation this leaves a few open questions:

1) Do we want this to be implicit behaviour as it is in base R (this is also implemented as implicit behaviour in testthat/waldo) or do we want to make it an explicit option for the user to toggle between?

An immediate problem with making it explicit is that users may want different behaviour for different variables e.g. maybe it makes sense to use absolute for 1 variable but relative for another. Only way around this would be to enable per variable specification but this would make the interface potentially quite clunky.

2) Within SAS the above "relative" is actually known as "percent" with "relative" being a separate comparison formula (I think SAS's "relative" is actually "Arithmetic mean change" see the table in the following wiki article on this). Do we also want to support this method ?

This would also further compound the above "implicit" vs "explicit" problem as with a 3rd method it would almost certainly force the use of explicit method selection.

My current thinking is that for the short term we should just adopt the same implicit behaviour as base R as this is pretty flexible and I would be willing to bet covers 99% of use cases. I think as a longer term it may be worth opting for an explicit override interface where you can set different comparison methods for different variables but this would likely take a much more substantial rework so I am hesitant to implement it right now.

gowerc commented 3 months ago

The more I think about this the more I think it will need a minor overhaul. I was playing around with just implementing the current all.equal approach but I just couldn't get over the awkward edge case of "what if we only want to apply this for one variable".

The following is my proposal for how I think we could handle this:

Introduce a method argument to diffdf that receives a comparison_control object. The comparison_control is responsible for defining the default comparison methods for each class. The default would be something like:

comparison_control(
    numeric = compare_absolute(tolerance = 1, scale = NULL),
    character = compare_character(),
    factor = compare_factor(),
    other = compare_exact(),
    override = list()
)

Key bit is then the override argument which would allow for variable specific deviations e.g. as an end user you could specify something like:

diffdf(
    dat1,
    dat2,
    method = comparison_control(
        override = list( 
            "var1" = compare_relative(tolerance = 0.1),
            "var2" = compare_relative(tolerance = 0.5),
        )
    )
)

In particular this approach would allow us to extend out a bunch of other features for example allowing "case insensitive" comparisons for strings or "labels only" comparisons for factors. It would also allow us to implement the other comparison methods listed in the above wiki article.

I've only spent 5 minutes thinking about this but I imagine perhaps something like the following:

compare_dynamic(tolerance, scale)        # equivalent to all.equal
compare_absolute(tolerance, scale)
compare_relative(tolerance)
compare_arithmetic_mean(tolerance)
compare_geometric_mean(tolerance)
compare_character(case_sensitive, trim)
compare_factor(labels_only)

gowerc commented 3 months ago

Final comment for now is that given the potential complexity of this and that it is likely to result in a break in backwards compatibility I am going to push this as a feature for v2.0.0 and not v1.1.0

bundfussr commented 3 months ago

Looks good to me. It would solve the issues we had in our project when comparing datasets.

Would the new feature also allow to implement user specific comparisons? I recently wanted to compare datasets where some of the columns contained lists.

kieranjmartin commented 3 months ago

I think I agree with this approach, but we might need to think about the syntax. I think this might also relate to #45 in terms of making what diffdf much more customisable.

@bundfussr that's not a bad idea in terms of if we make it so that each of those comparisons name a function. Then the user could insert their own function, provided we had a clear definition in how comparison functions worked. This would obviously be part of a much larger change to the functioning of diffdf!

I am thinking it might look like this; passing a function name rather than a function I think would make it easier to program.

diffdf(
    dat1,
    dat2,
    method = comparison_control(
        override = list( 
            "var1" = list(fun = compare_relative, args = list(tolerance = 0.5)),
        )
    )
)

kieranjmartin commented 3 months ago

As an FYI @gowerc R Core actually updated that documentation a little while after that overflow, so it should now be correct

gowerc commented 3 months ago

@kieranjmartin

I must say I feel strongly about the user providing functions / lambdas as opposed to a list. If you need to embed arguments I feel lambdas are definitely the more idiomatic way to go e.g.

diffdf(
    dat1,
    dat2,
    method = comparison_control(
        override = list( 
            "var1" = function (x) compare_relative(x, tolerance = 0.5)
        )
    )
)

Let me know if you strongly disagree 😅

gowerc commented 3 months ago

@bundfussr

Can I just double check what you mean by:

Would the new feature also allow to implement user specific comparisons?

Do you mean just being able to provide different comparison methods on the fly? Or do you mean you want like user profiles / config ? Regarding lists we do have #36 e.g. we hope to be able to do this natively in the future with dplyr (most likely update 2.0) if that helps at all.

bundfussr commented 3 months ago

@bundfussr

Can I just double check what you mean by:

Would the new feature also allow to implement user specific comparisons?

Do you mean just being able to provide different comparison methods on the fly? Or do you mean you want like user profiles / config ? Regarding lists we do have #36 e.g. we hope to be able to do this natively in the future with dplyr (most likely update 2.0) if that helps at all.

In the use case I was referring to it was just a single diffdf() call. I.e., specifying a user-specific comparison method on the fly would have been sufficient. But I'm sure there are other use cases where user profiles/configs would be better. So ideally having both options would be nice.

Native list comparison would be great. However, list are very flexible. Thus I would expect that users want to use their own comparison methods for list columns.

gowerc / diffdf

Implement Relative Criterion for Differences of Numeric Columns #82