paulfitz / daff

align and compare tables
https://paulfitz.github.io/daff
MIT License
790 stars 68 forks source link

question: Is it possible to specify a tolerance value for floating point comparisons? #155

Closed suyashb95 closed 3 years ago

suyashb95 commented 4 years ago

Is there a compare flag that can be used to specify a small tolerance value such that the diffing algorithm ignores changes where the delta is less than the tolerance?

paulfitz commented 4 years ago

Good question! There isn't. There was a discussion about this in #59. To summarize:

  1. Suppose daff is giving you diffs where rows are aligned correctly but some cells are shown as changed because of floating point issues. Fixing this is fairly easy.
  2. Suppose daff is giving you diffs where a row in the original table and a row in the current table are treated as different because of floating point issues. Fixing this is fairly hard.

What do you say if instead of tolerance there were quantization, meaning rounding to a certain number of decimal places? In that case, this would be a fairly easy fix. The difference is whether hashing can be used to find matches or you need to do an N-to-N comparison.

suyashb95 commented 4 years ago

@paulfitz thanks for summarizing!

For some more context, I'm using the python bindings for daff as shown in the example

The problem I'm facing falls into the first category. There is a defined primary key and rows are aligned correctly but tiny differences in numbers are highlighted as changes (like 123.45 -> 123.46). I'm guessing in this case we can post process the diff somehow to remove these?

I've tried rounding the data to 2 decimal places before running daff and the diff is significantly cleaner but, there are a few cases where one value is rounded up and another is rounded down because of floating point precision differences.

Having quantization as a feature sounds good, I'm not sure if hashing is a good idea for numerical comparisons though. What do you think?

paulfitz commented 4 years ago

@Suyash458 I added a daff --ignore-epsilon 0.1 flag for ignoring floating point differences up to a threshold (for non-primary-key comparisons). Hope this helps.

suyashb95 commented 4 years ago

@paulfitz whew that was fast 🙂, thanks a lot for adding this feature! I'm guessing w.r.t the Python API it's equivalent to the snippet below?

flags = daff.CompareFlags()
flags.ignore_epsilon = 0.01

I will try it out let you know

suyashb95 commented 3 years ago

Apologies for being super late on this but, it works as expected! Tested with v1.3.46