internetarchive / archive-pdf-tools

Fast PDF generation and compression. Deals with millions of pages daily.
https://archive-pdf-tools.readthedocs.io/en/latest/
GNU Affero General Public License v3.0
86 stars 13 forks source link

Small difference in compressionratio #31

Open rmast opened 2 years ago

rmast commented 2 years ago

See my post after the closed https://github.com/internetarchive/archive-pdf-tools/issues/30

There is a small compression-ratio difference between your and my setup. Could that signal a memory-leak, or some other difference in setup?

MerlijnWajer commented 2 years ago

Thanks for the detailed explanation in the other issue, hope it wasn't too painful to get it all going (on Ubuntu I got jbig2enc from https://notesalexp.org/packages/en/bionic/amd64/jbig2enc/download.html, on Gentoo - my main system, it's just in the package manager). Going forward I will try to make a pyinstaller self contained binary so that it will be easier to use the program on Linux (or even Windows/OS X).

Differences in compression ratio might not signal a problem. There could be a few reasons:

I just upgraded my jbig2enc to 0.29 and it doesn't make a difference. If you can share your output PDF with me I can compare. I used the files from your Documents.zip.

MerlijnWajer commented 2 years ago

It is very unlikely that the problem is a memory leak, for what it's worth.

MerlijnWajer commented 2 years ago

outfa.pdf

Here is the PDF I get, for what it is worth.

rmast commented 2 years ago

The exact download of Kakadu I used is specified in my commands. I'll compare the contained jp2-images when I get the chance.

rmast commented 2 years ago

The relevent part of Kakadu consists of kdu_compress, kdu_extract and one shared library you can find with

ldd `which kdu_extract`
MerlijnWajer commented 2 years ago

I tried kakadu 8.0.5 as opposed to 8.0.3 that I had and the result is the same, I get the same PDF, the only difference is the kakadu version encoded in tjhe JPEG2000, XMP metadata and PDF IDs, compression ratio is still 39.780225.

Maybe it's just a floating point thing. If you can share the PDF I can diff, or look at the differences between the one I shared and yours. diff -a foo.pdf bar.pdf might help. You can also use the pdfimagesmrc tool that comes with this software, but by default it rounds off the image sizes to two digits.

rmast commented 2 years ago

This is the pdf from the build I made: outfa.pdf

rmast commented 2 years ago

There are differences between your file and mine in the size of the pictures:

oem@Robert:~/vergelijk$ pdfimages -all ../outfa-Merlijn.pdf Merlijn
oem@Robert:~/vergelijk$ pdfimages -all ../outfa.pdf mijn
oem@Robert:~/vergelijk$ ls -al
totaal 58736
drwxrwxr-x  2 oem oem     4096 nov 30 19:33 .
drwxr-xr-x 22 oem oem     4096 nov 30 19:30 ..
-rw-rw-r--  1 oem oem      457 nov 30 19:32 Merlijn-000.jp2
-rw-rw-r--  1 oem oem     6248 nov 30 19:32 Merlijn-001.jp2
-rw-rw-r--  1 oem oem     6077 nov 30 19:32 Merlijn-002.jb2e
-rw-rw-r--  1 oem oem      538 nov 30 19:32 mijn-000.jp2
-rw-rw-r--  1 oem oem     6241 nov 30 19:33 mijn-001.jp2
-rw-rw-r--  1 oem oem     6077 nov 30 19:33 mijn-002.jb2e
MerlijnWajer commented 2 years ago

Well, it looks like they are clearly different. It is possible that the sauvola binarisation code results in slightly different masks (since it's compiled with -Ofast), but then I would have expected the jb2e to be of a different size, which it is not. The other Cython code uses ints only.

The current code also has an option to dump these items to a directory when creating the PDF (--out-dir), but it only stores the compressed JPEG2000 files, so that's not useful, since we can just get those from the PDF. I will have to change the code to also dump the files as png (or similar) so that we can see if the files are different before being encoded to JPEG2000.