red6 / pdfcompare

A simple Java library to compare two PDF files
Apache License 2.0
220 stars 66 forks source link

inclusions and exclusions #30

Open Ebsy opened 6 years ago

Ebsy commented 6 years ago

Hi, It goes without saying that the exclusions array in the HOCON file is incredibly useful. Would it also be feasible to specify content areas that should be tested as well as ones that shouldn't?

An array of pages that should be tested would also be great. Possibly also a random selection of pages.

Don't get me wrong, I will also look into the source and see if these features are something I could contribute to the project I just thought I'd ask to see if any work has started on them.

finsterwalder commented 6 years ago

Do I understand you correctly: You want to specify inclusions regions for one page or all pages. When you do so, only those areas inside the inclusion regions are checked for differences. Exclusions inside the inclusion regions are still obeyed. Never thought about this... I think about whether to put it in a different file or the same file... I will get back to you...

Ebsy commented 6 years ago

Yes exactly.

For example, I often need to compare multi-page PDFs but checking page 1, 3 and 5 is enough (out of 100 pages)

config.conf inclusions: [ { page: [1, 3, 5] // or 'rand' would be great. } ]

As for the areas, many of the pdfs have variable data surrounding the pages (e.g. barcodes, service lines etc.) these aren't necessary to compare just the content in the 'centre' so specifying one content box would be easier. Right now (since yesterday ;)) I simply add exclusion boxes for these variable elements.

finsterwalder commented 6 years ago

What do you mean with "rand"? The current workaround is to create enough exclusions, of course. But I can see how inclusions could make it easier for those situations.

Ebsy commented 6 years ago

rand being a random page. or 'rand(6)' to pick 6 random pages to compare.

finsterwalder commented 6 years ago

What sense does it make to compare random pages?

Ebsy commented 6 years ago

well, wouldn't it be quicker to just compare a subset of pages rather than the entire document? At least in theory?

finsterwalder commented 6 years ago

Quicker, yes. But you are only comparing a subset, so you are loosing confidence. When comparing only random pages, you also loose reproducability. Your test fail randomly as well. A very bad trade-off when you ask me...

Ebsy commented 6 years ago

When one is dealing with hundreds of thousands of pages spread over hundreds of documents it's impracticable from a time/resources point of view to compare each and every one (outside of a dev/test environment) if a small subset is compared (and the specific pages reported in the output) then the test is, of course, reproducible and 'spot checks' could be inserted into to the production workflow without delaying the process too much.

At the end of the day, it's just a feature idea and not a deal-breaker!