Open bundfussr opened 1 year ago
We are planning a minor overhaul of the package in Q4 so will look to include this change in then. Thank you for raising.
Great, looking forward to the new version.
Apologies a bit of a rambly post but just wanted to share my current understanding of this.
First off I want to document the behaviour of base R's all.equal
for reference (mostly because this confused me and I want to have this available to look back at later when I inevitably forget)
General syntax is:
all.equal(target, current, tolerance, scale)
scale
is not null then it will always use absolute differencesscale
is null then:
mean(abs(target))
<= tolerance
will use absolute differences (with scale
set to 1)mean(abs(target))
> tolerance
will use relative differencesJust to be explicit:
abs( (target - current) / scale ) > tolerance
abs( (target - current) / mean(abs(target)) ) > tolerance
There is a good stack overflow post on this here which explains what I've written above and shows how this deviates from the documentation.
In terms of implementation this leaves a few open questions:
1) Do we want this to be implicit behaviour as it is in base R (this is also implemented as implicit behaviour in testthat/waldo) or do we want to make it an explicit option for the user to toggle between?
2) Within SAS the above "relative" is actually known as "percent" with "relative" being a separate comparison formula (I think SAS's "relative" is actually "Arithmetic mean change" see the table in the following wiki article on this). Do we also want to support this method ?
My current thinking is that for the short term we should just adopt the same implicit behaviour as base R as this is pretty flexible and I would be willing to bet covers 99% of use cases. I think as a longer term it may be worth opting for an explicit override interface where you can set different comparison methods for different variables but this would likely take a much more substantial rework so I am hesitant to implement it right now.
The more I think about this the more I think it will need a minor overhaul. I was playing around with just implementing the current all.equal
approach but I just couldn't get over the awkward edge case of "what if we only want to apply this for one variable".
The following is my proposal for how I think we could handle this:
Introduce a method
argument to diffdf that receives a comparison_control
object. The comparison_control
is responsible for defining the default comparison methods for each class. The default would be something like:
comparison_control(
numeric = compare_absolute(tolerance = 1, scale = NULL),
character = compare_character(),
factor = compare_factor(),
other = compare_exact(),
override = list()
)
Key bit is then the override
argument which would allow for variable specific deviations e.g. as an end user you could specify something like:
diffdf(
dat1,
dat2,
method = comparison_control(
override = list(
"var1" = compare_relative(tolerance = 0.1),
"var2" = compare_relative(tolerance = 0.5),
)
)
)
In particular this approach would allow us to extend out a bunch of other features for example allowing "case insensitive" comparisons for strings or "labels only" comparisons for factors. It would also allow us to implement the other comparison methods listed in the above wiki article.
I've only spent 5 minutes thinking about this but I imagine perhaps something like the following:
compare_dynamic(tolerance, scale) # equivalent to all.equal
compare_absolute(tolerance, scale)
compare_relative(tolerance)
compare_arithmetic_mean(tolerance)
compare_geometric_mean(tolerance)
compare_character(case_sensitive, trim)
compare_factor(labels_only)
Final comment for now is that given the potential complexity of this and that it is likely to result in a break in backwards compatibility I am going to push this as a feature for v2.0.0 and not v1.1.0
Looks good to me. It would solve the issues we had in our project when comparing datasets.
Would the new feature also allow to implement user specific comparisons? I recently wanted to compare datasets where some of the columns contained lists.
I think I agree with this approach, but we might need to think about the syntax. I think this might also relate to #45 in terms of making what diffdf much more customisable.
@bundfussr that's not a bad idea in terms of if we make it so that each of those comparisons name a function. Then the user could insert their own function, provided we had a clear definition in how comparison functions worked. This would obviously be part of a much larger change to the functioning of diffdf!
I am thinking it might look like this; passing a function name rather than a function I think would make it easier to program.
diffdf(
dat1,
dat2,
method = comparison_control(
override = list(
"var1" = list(fun = compare_relative, args = list(tolerance = 0.5)),
)
)
)
As an FYI @gowerc R Core actually updated that documentation a little while after that overflow, so it should now be correct
@kieranjmartin
I must say I feel strongly about the user providing functions / lambdas as opposed to a list. If you need to embed arguments I feel lambdas are definitely the more idiomatic way to go e.g.
diffdf(
dat1,
dat2,
method = comparison_control(
override = list(
"var1" = function (x) compare_relative(x, tolerance = 0.5)
)
)
)
Let me know if you strongly disagree 😅
@bundfussr
Can I just double check what you mean by:
Would the new feature also allow to implement user specific comparisons?
Do you mean just being able to provide different comparison methods on the fly? Or do you mean you want like user profiles / config ? Regarding lists we do have #36 e.g. we hope to be able to do this natively in the future with dplyr (most likely update 2.0) if that helps at all.
@bundfussr
Can I just double check what you mean by:
Would the new feature also allow to implement user specific comparisons?
Do you mean just being able to provide different comparison methods on the fly? Or do you mean you want like user profiles / config ? Regarding lists we do have #36 e.g. we hope to be able to do this natively in the future with dplyr (most likely update 2.0) if that helps at all.
In the use case I was referring to it was just a single diffdf()
call. I.e., specifying a user-specific comparison method on the fly would have been sufficient. But I'm sure there are other use cases where user profiles/configs would be better. So ideally having both options would be nice.
Native list comparison would be great. However, list are very flexible. Thus I would expect that users want to use their own comparison methods for list columns.
For columns with a wide range of the magnitude, e.g., ratios the comparison does not work well if
tolerance
is used. I would suggest to implement a relative criterion like inproc compare
(https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/proc/p0bbu58eqgufwzn16zafm1hvzfw2.htm#p0jwvxsipe3919n1n0areynrgdok).