red6 / pdfcompare

A simple Java library to compare two PDF files
Apache License 2.0
220 stars 66 forks source link

Make large diff blocks more important #129

Closed jsmucr closed 1 year ago

jsmucr commented 1 year ago

Would it be possible to flag larger diff areas as more important than others?

The current state is that a large amount of differences in single pixels all over the page may easily cross the diff percent threshold, whereas a single but significant difference, such as swapped words in a text block, passes the test without noticing.

finsterwalder commented 1 year ago

I'm not sure I understand you correctly. You want to use "allowedDifferenceInPercentPerPage", but when a lot of pixels differ that are very close together, you still want to flag that as a difference? So a few pixels that differ scattered over the page is ok, but when all the difference occurs in one place, it's still flagged?

That is not so easy to achieve, since the software does not keep track of pixel location in relation to other pixels. I just handle each pixel individually, so to speak. The best you can do for now is: Try to keep your PDFs as identical as possible. Lower the "allowedDifferenceInPercentPerPage" value as much as you can without getting false positives. User explicit exclusions, where things change regularly.

If you want, you could of course implement the change yourself. It's open source after all. I will not implement that (at least it's very, very unlikely).

jsmucr commented 1 year ago

So a few pixels that differ scattered over the page is ok, but when all the difference occurs in one place, it's still flagged?

Yes.

The issue I'm facing is that I compare PDFs created on Linux vs Windows, and the fonts are a little different on both systems. This leads to tiny and ignorable differences in large numbers. However if there's a large, block area of a significant difference (in my case a large text with words swapped), then such difference cannot be ignored even though the sum of different pixels is still below the threshold.

As of now, the only workaround I can think of would be lowering both the resolution and the threshold, so that small, insignificant differences just disappear.

finsterwalder commented 1 year ago

I totally understand your problem now. And your request would be a valuable addition to PdfCompare. I just don't have the capacity to implement that. Sorry.

jsmucr commented 1 year ago

@finsterwalder No problem. :-) Maybe I will at some point.