red6 / pdfcompare

A simple Java library to compare two PDF files
Apache License 2.0
220 stars 66 forks source link

Pdf to PNG issue? Difference found based on JVM version #76

Closed v3g3t4x closed 4 years ago

v3g3t4x commented 4 years ago

Hi guys, I love this Project. I have a strange issue, compare give me difference between two pdf equals. Durino pixel by pixel compare I see some pixel with different value. But only on Linux, if I run on Windows all seems works fine. But if I open the two pdf are the same. Is possible that for some reason from The same pdf sometimes is generated a png Little different?

finsterwalder commented 4 years ago

I have never observed, that the same PDF generates different images on the same machine. I have observed that two PDFs that look alike to the human observer do have very small differences. I think it's possible, that there may be small differences between Linux and Windows. But I have never observed something myself. Different Java versions may produce slightly different results as well. Do you use the identical Java version on Linux and Windows? Other than that, I don't know whether I can help. The rendering to images is done by the PdfBox library. I can't help there. I could only help, when the problem is in the comparison itself. When you can't solve your problem, maybe the allowedDifferenceInPercentPerPage setting may be a pragmatic solution. Not perfect, I know...

v3g3t4x commented 4 years ago

Ok thanks. I Will try to do more test And come Back with more info

Lonzak commented 4 years ago

Does your PDF containt text? We have the same behaviour depending on the fonts used. If those fonts are not embedded into the PDF, PDFBox relies on the system default fonts / fallback fonts. And those differ between windows and linux. Thus you get a result with minor differences. At the end we implemented our own calculateDifferences method (and below a certain percentage we consider everything as fine) because we didn't know about 'allowedDifferenceInPercentPerPage' back then...

finsterwalder commented 4 years ago

allowedDifferenceInPercentPerPage was added later, so depending on when you did that, it may have not been present yet...

v3g3t4x commented 4 years ago

Hi, I have some interesting news.

Here some info: 1) My pdf contains text and picture 2) I found that the difference is not between Windows and Linux only, the difference is between Linux and Container too And I found the reason (I think).

I wrote a simple jar that use pdfcompare and take as input two pdf files. If I run it from shell linux on pdf1 and pdf2 don't find difference. If I run it into a docker container on the same machine PdfCompare found difference on the same pdf files.

After some check I notice that on the linux machine is used: java version "1.8.0_121" Java(TM) SE Runtime Environment (build 1.8.0_121-b13) Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)

But into docker container we have: openjdk version "1.8.0_181" OpenJDK Runtime Environment (build 1.8.0_181-8u181-b13-1~deb9u1-b13) OpenJDK 64-Bit Server VM (build 25.181-b13, mixed mode)

Same version but OpenJDK. I try on the server where all work fine to run the jar with a openjdk and found difference. This means that the issue is 100% related to different JVM.

Why OpenJDK found difference that not exists? This isn't a good situation. Ideas?

Lonzak commented 4 years ago

One is the oracle JDK and the other is openJDK. This can of course also result in different results especially in the rendering area. Open JDK 8 has a few differences: they had other 2D, font rendering, serviceability/management, and crypto libraries and that could cause performance differences and rendering issues. Only starting from Java11 (except for minor differences) both versions are equal.

v3g3t4x commented 4 years ago

Ok I understand but this means that pdfCompare with this version of OpenJDK don't works. Correct?

Lonzak commented 4 years ago

No - it works fine but it depends what you compare your results with: If you compare two PDFs (rendered as images) created on the same platform you should be fine. Just the inter-platform or inter-jdk comparison might result in slightly different results... But as Malte indicated I would just add an allowedDifferenceInPercentPerPage factor and you should be fine...

finsterwalder commented 4 years ago

I also noticed differences between different Jdk versions or between Oracle Jdk and OpenJdk before. In some details they are just not the same What I find interesting is, that both PDFs are off course rendered in the same JVM and still produce different results for some reason. That is surprising to me. Did you try an exact copy of the PDF or do you use two PDFs?

Lonzak commented 4 years ago

Really? Same JVM, same platform and same PDF and got different results? This I would find strange, too. Except for different timestamps etc. this would be highly irregular...

v3g3t4x commented 4 years ago

Jar compiled with Oracle JVM 1.8 Same platform: Linux Red Hat Same PDF file (PDF1.pdf and PDF2.pdf both with one page identical)

If I run it with java -jar using Oracle JVM PDF result without difference (as really is, it's OK). If I run it with java -jar using Open JDK JVM PDF result with some difference (but file is with the same content).

I need to build the jar with OpenJDK?

Lonzak commented 4 years ago

The bytecode of the classes in the jar should normally not be affected. So in your 2nd scenario using openJDK you get two different results for the same PDF? What happens if you run it twice? Do you get 4 slightly different files? Or 2x2 the same?

v3g3t4x commented 4 years ago

No always the same result, with openjdk I receive that PDf1 and PDF2 are different for a lot of Pixel. In both scenario I use the same set of pdf files:PDF1 and PDF2. PDF1 and PDF2 aren't the same files. But two different file generate by two different system but that contains the same invoice of one page. With OracleJDK, pdfcompare retuen that result.equals is true with OpenJDK is false.

Lonzak commented 4 years ago

Ah ok so there are two different PDF files (even though they contain the same content). As mentioned before in openJDK 8 there were a lot of differences between open- and oracleJDK. Can you try Java11 and see how it behaves there? I bet the result is the same then...

v3g3t4x commented 4 years ago

Ok I will try if possible. Where I can find a list of difference between this two JVM? I am corious to understand where this test case have a difference between the two JVM. Now, I built project with OpenJDK and run it with OpenJDK and the two PDF return with difference, in the same way. Why? The PDF are equals...I don't understand.

Lonzak commented 4 years ago

There was a good site however it seems it went offline... If you search the web you'll find lots of pages: e.g. https://jaxenter.com/oracle-jdk-builds-openjdk-builds-difference-149318.html

Lonzak commented 4 years ago

Did you look at the differences? What are those? As mentioned before especially in the rendering and fonts rendering area openJDK behaves differently. And this seems to be the case for your PDF(s). Please try JDK11 (oracle and open) and report the results...

v3g3t4x commented 4 years ago

Ok, I will keep you updated. Now I am trying to check difference between databuffer generated from pdf with the 2 JVM.

v3g3t4x commented 4 years ago

In our environment I can't install OpenJDK 11 but I have some additional info.

Oracle JVM 1.8 SCENARIO I said that always PDF1 and PDF2 result equals, but now I notice that if I use 1200 dpi instead of default 300, in this scenario too pdfcompare find some differents between PDF1 and PDF2.

With OpenJDK this occurs always with 300, 600, 1200 DPI etc..

finsterwalder commented 4 years ago

Please make a copy of PDF1 and compare those two files. And do the same with PDF2. I'm curious whether you also get a difference, when you compare two files that are really the same inside.

From what you describe I get the impression that it's really a slight variation in font rendering. That's the most likely explanation so far.

And regarding Java 11, there is an OpenJdk Docker container easily available on dockerhub.

v3g3t4x commented 4 years ago

Done. Using PDF1.pdf and PDF1_copy.pdf with 300,600 and 1200 DPI result always equals. The same for PDF2.pdf and PDF2_copy.pdf always no difference found.

v3g3t4x commented 4 years ago

Additional info: in the case where difference found, I see that the pixel difference seems to be into BufferedImage around some letter... There are one or two pixel different. Seems that transforming PDF to Buffered Image starting from different file with the same content the rendering have little micro difference. May be? In Oracle JDK 1.8 this occurs only for DPI >= 600

finsterwalder commented 4 years ago

Thanks. So the rendering of the two PDFs differs very slightly and this is detected. There is nothing I can do about that, since rendering is part of PdfBox. I will close this issue. Thanks to all for the through in investigation!

Lonzak commented 4 years ago

To finalize it - since there are different values for different DPIs it could be different rounding of numbers... But in the end as mentioned before: Try Java 11 or stay with orcacle JDK or accept minimal differences or ... There are lot of potential solutions for this.