Open rmast opened 2 years ago
Thanks for the detailed explanation in the other issue, hope it wasn't too painful to get it all going (on Ubuntu I got jbig2enc from https://notesalexp.org/packages/en/bionic/amd64/jbig2enc/download.html, on Gentoo - my main system, it's just in the package manager). Going forward I will try to make a pyinstaller
self contained binary so that it will be easier to use the program on Linux (or even Windows/OS X).
Differences in compression ratio might not signal a problem. There could be a few reasons:
I just upgraded my jbig2enc to 0.29 and it doesn't make a difference. If you can share your output PDF with me I can compare. I used the files from your Documents.zip
.
It is very unlikely that the problem is a memory leak, for what it's worth.
Here is the PDF I get, for what it is worth.
The exact download of Kakadu I used is specified in my commands. I'll compare the contained jp2-images when I get the chance.
The relevent part of Kakadu consists of kdu_compress, kdu_extract and one shared library you can find with
ldd `which kdu_extract`
I tried kakadu 8.0.5 as opposed to 8.0.3 that I had and the result is the same, I get the same PDF, the only difference is the kakadu version encoded in tjhe JPEG2000, XMP metadata and PDF IDs, compression ratio is still 39.780225
.
Maybe it's just a floating point thing. If you can share the PDF I can diff, or look at the differences between the one I shared and yours. diff -a foo.pdf bar.pdf
might help. You can also use the pdfimagesmrc
tool that comes with this software, but by default it rounds off the image sizes to two digits.
There are differences between your file and mine in the size of the pictures:
oem@Robert:~/vergelijk$ pdfimages -all ../outfa-Merlijn.pdf Merlijn
oem@Robert:~/vergelijk$ pdfimages -all ../outfa.pdf mijn
oem@Robert:~/vergelijk$ ls -al
totaal 58736
drwxrwxr-x 2 oem oem 4096 nov 30 19:33 .
drwxr-xr-x 22 oem oem 4096 nov 30 19:30 ..
-rw-rw-r-- 1 oem oem 457 nov 30 19:32 Merlijn-000.jp2
-rw-rw-r-- 1 oem oem 6248 nov 30 19:32 Merlijn-001.jp2
-rw-rw-r-- 1 oem oem 6077 nov 30 19:32 Merlijn-002.jb2e
-rw-rw-r-- 1 oem oem 538 nov 30 19:32 mijn-000.jp2
-rw-rw-r-- 1 oem oem 6241 nov 30 19:33 mijn-001.jp2
-rw-rw-r-- 1 oem oem 6077 nov 30 19:33 mijn-002.jb2e
Well, it looks like they are clearly different. It is possible that the sauvola binarisation code results in slightly different masks (since it's compiled with -Ofast
), but then I would have expected the jb2e to be of a different size, which it is not. The other Cython code uses ints only.
The current code also has an option to dump these items to a directory when creating the PDF (--out-dir
), but it only stores the compressed JPEG2000 files, so that's not useful, since we can just get those from the PDF. I will have to change the code to also dump the files as png
(or similar) so that we can see if the files are different before being encoded to JPEG2000.
See my post after the closed https://github.com/internetarchive/archive-pdf-tools/issues/30
There is a small compression-ratio difference between your and my setup. Could that signal a memory-leak, or some other difference in setup?