Small difference in compressionratio

rmast commented 2 years ago

See my post after the closed https://github.com/internetarchive/archive-pdf-tools/issues/30

There is a small compression-ratio difference between your and my setup. Could that signal a memory-leak, or some other difference in setup?

MerlijnWajer commented 2 years ago

Thanks for the detailed explanation in the other issue, hope it wasn't too painful to get it all going (on Ubuntu I got jbig2enc from https://notesalexp.org/packages/en/bionic/amd64/jbig2enc/download.html, on Gentoo - my main system, it's just in the package manager). Going forward I will try to make a pyinstaller self contained binary so that it will be easier to use the program on Linux (or even Windows/OS X).

Differences in compression ratio might not signal a problem. There could be a few reasons:

Different kakadu versions compress slightly differently (maybe)
Different jbig2enc versions compress slightly differently (I currently use 0.28, not the newer 0.29, I just haven't upgraded yet)
Different compression parameters (unlikely)
Different PDF metadata (again unlikely, you don't specify it)
Slightly different hOCR input / text data (again unlikely)

I just upgraded my jbig2enc to 0.29 and it doesn't make a difference. If you can share your output PDF with me I can compare. I used the files from your Documents.zip.

MerlijnWajer commented 2 years ago

It is very unlikely that the problem is a memory leak, for what it's worth.

MerlijnWajer commented 2 years ago

outfa.pdf

Here is the PDF I get, for what it is worth.

rmast commented 2 years ago

The exact download of Kakadu I used is specified in my commands. I'll compare the contained jp2-images when I get the chance.

rmast commented 2 years ago

The relevent part of Kakadu consists of kdu_compress, kdu_extract and one shared library you can find with

ldd `which kdu_extract`

MerlijnWajer commented 2 years ago

I tried kakadu 8.0.5 as opposed to 8.0.3 that I had and the result is the same, I get the same PDF, the only difference is the kakadu version encoded in tjhe JPEG2000, XMP metadata and PDF IDs, compression ratio is still 39.780225.

Maybe it's just a floating point thing. If you can share the PDF I can diff, or look at the differences between the one I shared and yours. diff -a foo.pdf bar.pdf might help. You can also use the pdfimagesmrc tool that comes with this software, but by default it rounds off the image sizes to two digits.

rmast commented 2 years ago

This is the pdf from the build I made: outfa.pdf

rmast commented 2 years ago

There are differences between your file and mine in the size of the pictures:

oem@Robert:~/vergelijk$ pdfimages -all ../outfa-Merlijn.pdf Merlijn
oem@Robert:~/vergelijk$ pdfimages -all ../outfa.pdf mijn
oem@Robert:~/vergelijk$ ls -al
totaal 58736
drwxrwxr-x  2 oem oem     4096 nov 30 19:33 .
drwxr-xr-x 22 oem oem     4096 nov 30 19:30 ..
-rw-rw-r--  1 oem oem      457 nov 30 19:32 Merlijn-000.jp2
-rw-rw-r--  1 oem oem     6248 nov 30 19:32 Merlijn-001.jp2
-rw-rw-r--  1 oem oem     6077 nov 30 19:32 Merlijn-002.jb2e
-rw-rw-r--  1 oem oem      538 nov 30 19:32 mijn-000.jp2
-rw-rw-r--  1 oem oem     6241 nov 30 19:33 mijn-001.jp2
-rw-rw-r--  1 oem oem     6077 nov 30 19:33 mijn-002.jb2e

MerlijnWajer commented 2 years ago

Well, it looks like they are clearly different. It is possible that the sauvola binarisation code results in slightly different masks (since it's compiled with -Ofast), but then I would have expected the jb2e to be of a different size, which it is not. The other Cython code uses ints only.

The current code also has an option to dump these items to a directory when creating the PDF (--out-dir), but it only stores the compressed JPEG2000 files, so that's not useful, since we can just get those from the PDF. I will have to change the code to also dump the files as png (or similar) so that we can see if the files are different before being encoded to JPEG2000.

internetarchive / archive-pdf-tools

Small difference in compressionratio #31