Does not render PDF correctly

Needles404 commented 5 years ago

I've been trying to use BRISS to select parts of a document to crop so we can print only the required parts, on the whole this works well, the only issue I have is when it loads the file so you can select the areas you want to keep the file is not loaded correctly so this leaves us guessing when the selections need to go.

I've attached the original file.

Environment is Windows 10 Java Build 1.8.0_191-b12 carton labels - FBA15BRWG0BD.pdf

cleydyr commented 5 years ago

I have verified this bug. The current selection boundaries are not clear when the document is loaded.

fatso83 commented 4 years ago

@Needles404 Have you tried using the original 0.9 version of Briss? It would be interesting to know if the bug is a regression or if it was present in the original source that was forked. The original Briss, awkward as it was to use, always did work perfectly, IMHO.

Needles404 commented 4 years ago

Hi, yeah this issue first appeared when using 0.9, I subsequently searched for more recent versions/forks but they exhibit the same issue.

On Wed, 22 Jan 2020, 14:20 Carl-Erik Kopseng, notifications@github.com wrote:

@Needles404 https://github.com/Needles404 Have you tried using the original 0.9 version of Briss? It would be interesting to know if the bug is a regression or if it was present in the original source that was forked. The original Briss, awkward as it was to use, always did work perfectly, IMHO.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/mbaeuerle/Briss-2.0/issues/18?email_source=notifications&email_token=ADZDUTIV67ABCHURDU6RG33Q7BI2DA5CNFSM4GA2Q3N2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJTWVXY#issuecomment-577202911, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADZDUTMBS72OSU7TLKSBQJTQ7BI2DANCNFSM4GA2Q3NQ .

cleydyr commented 4 years ago

I could verify that the images generated by PdfDecoder match the pages of the document. However, the BufferedImages that are being shown and which are used to produce the rectangles are already "broken". Even if I tune some parameters of the algorithms and the rectangle is shown, there's no point in cropping the document as the preview itself is "broken".

That's kinda the most I can do as I still barely understand the algorithms in all the steps that Briss uses.

mbaeuerle commented 4 years ago

@Needles404 I think I could track down the issue. Yeah I know, a little bit late but maybe it's helpful for you nevertheless.

As @cleydyr wrote the PdfDecoder returns a perfect image. And even if you apply cropping the resulting PDF looks perfectly fine.

It looks like the issue lies in the algorithm used to calculate the overlay image in ClusterImageData.calculateSdOfImages. I am guessing but from the naming and looking at the algorithm I suppose the sd stands for standard derivation and basically computes how far apart each pixel value is from the mean. Becaues every page in this particular PDF looks the same except for some parts of the barcode the black parts basically cancel out each other as the standard derivation is very small or zero.

I verified this with this prepared PDF which has 4 identical pages and the result is empty as suspected:

To fix this issue I think it is needed to exchange this algorithm. I will check how this could be done.

mbaeuerle commented 4 years ago

@Needles404 PR #26 fixes the issue for your PDF by using a new image overlay algorithm:

However now that I tinkered with this new algorithm I begin to understand why the standard derivation was used in the first place. Say you have a PDF with multiple pages but the same styling, e.g. a PowerPoint presentation with a logo on each slide at the same position. The standard derivation will then strip away the recurring parts like the logo. This is most often what you want as only the differing content is what's interesting.

In your case as the pages are almost identical on purpose this approach however doesn't work very well. Maybe it therefore makes sense to offer the new algorithm as fallback if the other one does not work.

Needles404 commented 4 years ago

Hey,

I really appreciate your work on this, it's going to save me a massive headache. Interesting to note the reasons for the initial issue.

Many thanks

On Wed, 22 Apr 2020, 23:25 Marian Bäuerle, notifications@github.com wrote:

@Needles404 https://github.com/Needles404 PR #26 https://github.com/mbaeuerle/Briss-2.0/pull/26 fixes the issue for your PDF by using a new image overlay algorithm: [image: grafik] https://user-images.githubusercontent.com/1345394/80039167-77356d00-84f7-11ea-8f55-06261a22b5b5.png

However now that I tinkered with this new algorithm I begin to understand why the standard derivation was used in the first place. Say you have a PDF with multiple pages but the same styling, e.g. a PowerPoint presentation with a logo on each slide at the same position. The standard derivation will then strip away the recurring parts like the logo. This is most often what you want as only the differing content is what's interesting.

In your case as the pages are almost identical on purpose this approach however doesn't work very well. Maybe it therefore makes sense to offer the new algorithm as fallback if the other one does not work.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/mbaeuerle/Briss-2.0/issues/18#issuecomment-618072087, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADZDUTKSQTTJTWFUKMWDCKLRN5VG7ANCNFSM4GA2Q3NQ .

mbaeuerle commented 4 years ago

@Needles404 I finally had to time to finish this. You can find the new version here: https://github.com/mbaeuerle/Briss-2.0/releases/tag/v2.0-alpha-3 Give it a shot and let me know if this works for you :)

With this new version the preview is basically falling back to another algorithm if a certain amount of the content is similar on all pages.

Needles404 commented 4 years ago

Thanks again, it is now showing a usable preview which is excellent. The preview appears to be missing some text which I can only attribute to use of language packs since they are mainly Asian characters, this would probably be a different issue.

Untitled

Package - FBA15CWS6NN8.pdf

mbaeuerle / Briss-2.0

Does not render PDF correctly #18