lcorbasson / pdfsizeopt

Automatically exported from code.google.com/p/pdfsizeopt
0 stars 0 forks source link

Corrupt jbig2 pages in output PDF #59

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago
What steps will reproduce the problem?
1. Run pdfsizeopt.py Pages1-7.pdf on windows taking the defaults and you'll get 
the problem.

What is the expected output? What do you see instead?
I expect the pages to be viewable and compressed.  The attached PDF is what I 
see, blank pages with error stating "Insufficient data for an image".

What version of the product are you using? On what operating system?
Latest from svn.  Windows7 32-bit.

Please provide any additional information below.
The attached log is the output of the run.  I'm also attaching the before 
compress PDF file and the after compress PDF file.  Also I found another viewer 
(STDU Viewer) that partially decodes the output PDF file so I'm attaching a 
screenshot of what it looks like.  And my statically compiled with vs2010 
jbig2.exe from Adam Langley's source on github.

Thanks,
Darren

Original issue reported on code.google.com by fdnc...@gmail.com on 26 Jun 2012 at 1:03

Attachments:

GoogleCodeExporter commented 8 years ago
Thank you for the detailed bug report. Based on the files 
image_optimze_error.txt and Pages1-7.psom.pdf you have uploaded I could figure 
what's going wrong. I'm almost sure that I've identitied an easy-to-fix bug in 
your jbig2.exe. Once you fix the bug, recompile jbig2.exe, and rerun 
pdfsizeopt, it will be fine.

On Windows it's possible to open files in either ASCII or binary mode. ASCII is 
the default; you can have binary by passing ...|O_BINARY to the 2nd argument of 
open(), or passing a string containing "b" (e.g. "rb" instead of "r"; "wb" 
instead of "w") to the 2nd argument of fopen(), or calling setmode(1, O_BINARY) 
to put stdout to binary mode. If a file is opened in ASCII mode, than all 
writes (e.g. write(...), putchar(...), fwrite(...), fprintf(...)) of "\n" (10) 
actually write "\r\n" (13, 10) to the file.

In our case, jbig2.exe writes the JBIG2-compressed image to its stdout, e.g. 
see the line

info: executing image optimizer jbig2: jbig2 -p pso.conv-3.sam2p-pr.png 
>pso.conv-3.jbig2

in the image_optimze_error.txt you have uploaded. The bug is that jbig2.exe 
writes to stdout in ASCII mode, but binary mode would be correct. It's easy to 
fix: please add setmode(1, O_BINARY) to the beginning of the main() function of 
jbig2.exe , recomplie jbig2.exe, and rerun the optimization like this:

$ pdfsizeopt.py --use-pngout=no Pages1-7.pdf

Now Pages1-7.psom.pdf should be correct, and the JBIG2 file should be a few 
bytes shorter, as indicated on the console output. Old, incorrect:

info: optimized image XObject 3 file_name=pso.conv-3.jbig2 size=2109 (58%) 
methods=jbig2:2109,#orig:3637,pngout:6793,sam2p_np:7011,sam2p_pr:8586,gs:11056

New, correct:

info: optimized image XObject 3 file_name=pso.conv-3.jbig2 size=2102 (58%) 
methods=jbig2:2102,#orig:3637,sam2p_np:7011,sam2p_pr:8586,gs:11050

(Please note the difference between 2019 and 2012 bytes.)

If this O_BINARY change doesn't fix the problem, then please upload the entire 
directory (containing the pso.* temporary files) ZIPped as an attachment to 
this issue. Also include the recompiled jbig2.exe you use, and the console 
output of pdfsizeopt.

To illustrate my point, I've modified a few bytes of Pages-1.7.psop.pdf : I've 
removed the 7 extra \r characters (and added some padding after the obj the 
make the file size the same). This effectively fixed the image of page 2. So if 
you make jbig2.exe not emit the \r characters, most probably the whole PDF 
would be fixed.

If you manage to fix jbig2.exe, please upload it as an attachment to this 
issue, so others would also benefit.

Original comment by pts...@gmail.com on 26 Jun 2012 at 9:10

Attachments:

GoogleCodeExporter commented 8 years ago
That fixed it.  Thanks for all your help!!!

Attached is my vs2010 compiled jbig2.exe and all the source code in case 
someone else wants to compile it.

Original comment by fdnc...@gmail.com on 27 Jun 2012 at 1:05

Attachments:

GoogleCodeExporter commented 8 years ago
Thank you for sharing your jbig2.exe and your source tree.

jbig2.exe was one of the missing dependencies of pdfsizeopt on Windows. Today I 
compiled the remaining few dependencies, so now pdfsizeopt is officially 
available on Windows, and it's easier to install than ever. If you're 
interested, please check out the new installation page at 
http://code.google.com/p/pdfsizeopt/wiki/InstallationInstructionWindows .

It would be very useful if you could upload all the library dependencies of 
jbig2enc_20120627.zip , including the URLs where you downloaded them from, and 
a .cmd file which compiles all the dependencies from scratch. So we could say 
to a future developer to install Visual Studio, download and extract a .zip 
file, run a .cmd file, and wait for jbig2.exe to be built automatically.

Original comment by pts...@gmail.com on 28 Jun 2012 at 2:20

GoogleCodeExporter commented 8 years ago
Hey, glad I could help.

I followed the instructions here
http://tpgit.github.com/UnOfficialLeptDocs/leptonica/README.html#building-on-win
dows
to
compile Leptonica (http://leptonica.com/) and download the dependancies.

I think you can just download the dependacies (
http://leptonica.org/source/leptonica-1.68-win32-lib-include-dirs.zip)  and
put everything in the right place to compile the jbig2 encoder.  I may have
done that.  I can't remember. ;)

Darren

Original comment by fdnc...@gmail.com on 9 Jul 2012 at 6:47

GoogleCodeExporter commented 8 years ago
This is what I get when I run your new windows version.

C:\Users\x991808\Desktop\pdfsizeopt_win32bin>pdfsizeopt.exe 000000.PDF
info: This is pdfsizeopt.py rUNKNOWN size=309327.
info: loading PDF from: 000000.PDF
info: loaded PDF of 515655 bytes
info: separated to 26 objs + xref + trailer
info: found 0 Type1 fonts loaded
info: found 0 Type1C fonts loaded
info: eliminated 2 unused objs in 2 classes
info: saving PDF with 24 objs with Multivalent to: 000000.psom.pdf
info: writing Multivalent input PDF: pso.conv.mi.tmp.pdf
info: generated object stream of 529 bytes in 21 objects (14%)
info: written 513629 bytes to Multivalent input PDF: pso.conv.mi.tmp.pdf
error: Multivalent.jar not found. Make sure it is on the $PATH, or it is
one of the files on the $CLASSPATH.
Traceback (most recent call last):
  File ".\pdfsizeopt.py", line 7698, in <module>
    main(sys.argv)
  File ".\pdfsizeopt.py", line 7694, in main
    may_obj_heads_contain_comments=may_obj_heads_contain_comments)
  File ".\pdfsizeopt.py", line 7425, in Save
    may_obj_heads_contain_comments=may_obj_heads_contain_comments)
  File ".\pdfsizeopt.py", line 7322, in _RunMultivalent
    assert 0, 'Multivalent.jar not found, see above'
AssertionError: Multivalent.jar not found, see above

Original comment by fdnc...@gmail.com on 9 Jul 2012 at 6:56

GoogleCodeExporter commented 8 years ago
AssertionError: Multivalent.jar not found, see above

Did you follow the installation instructions? Did you download the newest 
pdfsizeopt.py (its size is 313571)? If that still doesn't fix the problem, 
please copy-paste the output of

  dir /s C:\Users\x991808\Desktop\pdfsizeopt_win32bin

Original comment by pts...@gmail.com on 9 Jul 2012 at 8:14

GoogleCodeExporter commented 8 years ago
Yes, I followed the instructions but I tried again this morning (re-doing all 
the instructions) and everything is working fine now.  Running a massive PDF to 
test at the moment.  So far so good.  I just wish there was a way to speed up 
pngout.  That thing takes forever.

Original comment by fdnc...@gmail.com on 10 Jul 2012 at 2:29

GoogleCodeExporter commented 8 years ago
One last thing you should add is the msvcr100.dll since I compiled jbig2.exe 
with vs2010.  Here's mine.

Original comment by fdnc...@gmail.com on 10 Jul 2012 at 2:57

GoogleCodeExporter commented 8 years ago
About pngout: you can use --use-pngout=no . There is a speed vs size tradeoff 
here. pngout is slow, but its output is small.

Original comment by pts...@gmail.com on 10 Jul 2012 at 3:08

GoogleCodeExporter commented 8 years ago
Based on the information you have provided, I managed to compile a jbig2.exe 
(see it attached) suitable for use with pdfsizeopt. I compiled it using MinGW 
(cross-compiling on Linux), so it doesn't need msvcr100.dll . (I also removed 
the attached msvcr100.dll to avoid copyright issues in the future.)

In the near future, I'll release this new jbig2.exe so it will be used by 
default with pdfsizeopt on Windows.

FYI My jbig2.exe is noticeably smaller than yours, because I removed many 
unnecessary functions from the leptonica library (editing .c files by hand), 
and I also removed a few command-line flags which pdfsizeopt doesn't need.

Thank you very much for your help providing patches and compilation 
instructions, it helped me a lot in understanding jbig2 on Windows and 
preparing my own version.

Original comment by pts...@gmail.com on 11 Jul 2012 at 12:51

Attachments:

GoogleCodeExporter commented 8 years ago
Excellent!  Glad to hear you were able to get it compiled.  It wasn't trivial 
in VS2010 for me but MinGW is probably the easier choice, especially is you're 
used to Linux/gcc.  Sorry I wasn't able to provide the batch file you 
requested.  Just too much going on right now to mess with it.

You might want to try out this alternate version of JBIG2Enc 
https://github.com/zdenop/jbig2enc/tree/R.Hatlapatka.  It's supposed to have 
better autothresholding which I interpret to mean better compression on some 
images assuming the thresholding works.  I haven't tried it yet.

BTW - I tried the --use-pngout=no on my 146MB PDF file.  It took 20 minutes 
instead of 2.5 hours and the file sizes were identical.  So pngout doesn't seem 
to help unless you have color images.  Mine test file was all CCITTFaxDecode so 
maybe if you see that (which is always bitonal) you shouldn't call pngout?  
Just an idea to save time.

Original comment by fdnc...@gmail.com on 11 Jul 2012 at 5:23