gowerc / diffdf

DataFrame Comparison Tool
https://gowerc.github.io/diffdf/
Other
42 stars 5 forks source link

Granular Key Assertions #106

Closed gowerc closed 1 month ago

gowerc commented 2 months ago

Closes #76

github-actions[bot] commented 2 months ago

badge

Code Coverage Summary

Filename                Stmts    Miss  Cover    Missing
--------------------  -------  ------  -------  --------------------------
R/ascii_tables.R          105       8  92.38%   10, 148, 158, 163-166, 211
R/cast_variables.R         49       0  100.00%
R/diffdf.R                209      18  91.39%   373-390, 417
R/generate_keyname.R       10       1  90.00%   16
R/identify.R              152       8  94.74%   283-290
R/is_different.R           52       0  100.00%
R/issuerows.R              40       0  100.00%
R/issues.R                 17       1  94.12%   51
R/misc_functions.R         34       2  94.12%   9, 13
R/print.R                  20       0  100.00%
TOTAL                     688      38  94.48%

Results for commit: 1acb5051e6aa1e4919e3f3fe36c775c6c8a236b8

Minimum allowed coverage is 80%

:recycle: This comment has been updated with latest results

github-actions[bot] commented 2 months ago

Unit Tests Summary

  1 files   13 suites   6s :stopwatch:  52 tests  51 :white_check_mark: 1 :zzz: 0 :x: 578 runs  571 :white_check_mark: 7 :zzz: 0 :x:

Results for commit 1acb5051.

:recycle: This comment has been updated with latest results.

gowerc commented 2 months ago

I am wondering if we are changing this if we might also tell the user what the two different modes/classes are?

I think it would be much cleaner to have something like

"Error, key1 is numeric in BASE and character in COMPARE" rather than mentioning modes at all, which the user might not understand.

Similarly

"Error, key1 has class 'A' in BASE and class 'B' in COMPARE"

I'm not sure... like I get the appeal of this but in practice it's kinda messy. If there are several variables with mismatches or if its a variable with lots of classes it is likely difficult to print this in a nice way; ideally you'd probs then want to print a table but this then just feels like overkill. My gut feeling is what we have here is more than enough to inform the user on what to look into to fix the error.

kieranjmartin commented 2 months ago

Github doesn't have threaded replies :(

I think my main issue with the error message as it is is that mode may not mean anything to the user right now. Maybe even an explanation "A and B have different modes (different types of data, e.g. numeric and character)"

or maybe even just say they have different classes, as that will also be true I think, and classes is better understood, and we error if they dont match on class anyway?

gowerc commented 1 month ago

@kieranjmartin - Can you remember why we ended up using mode instead of typeof ? Doing a very quick service scan it almost feels like typeof is more granular and thus accurate. I also imagine messages of "are a different type" would be more meaningful that "different mode"

EDIT ---

Useful breakdown of where they differ to each other link some other commenters there seem to be inferring that typeof and class are the most important and that mode is pretty much value-less

Overall I'm not sure how I want to proceed with this, I really don't want to say "different types" or "different classes" if we are comparing modes because yes whilst many users won't know the difference that doesn't change the fact that it would be wrong / misleading. Potentially we could just add a "(see ?mode)" to point users in the right direction ?

kieranjmartin commented 1 month ago

Potentially we could just add a "(see ?mode)" to point users in the right direction ?

Yes maybe this is the best course forwards actually, just so they have some guidance as to what it means, as I don't think mode is a commonly used term (and honestly I would have to look it up to remember exactly what it means). I do not recall why we did not use type; I think mode is perhaps more pedantic than type, and we wanted to be that pedantic.? Although that list seems to say not

gowerc commented 1 month ago

Yer I just ran diffdf tests using type instead of mode and a load of the factor comparison code falls over. I think this is because factors store as ints that have type "integer" whilst reals have type "double" they both have a mode of "number" that is to say mode appears to be more fussy than type. I'm guessing at the time we thought this level of comparison doesn't matter though I feel inclined to disagree now (especially given that we have "strict_factor" to opt in and out of such pedantic comparisons which can be used to mask this).

I think I would propose switching to typeof instead though perhaps for v2 as its likely to introduce some backwards compatible breaks