pts / pdfsizeopt

PDF file size optimizer
GNU General Public License v2.0
750 stars 65 forks source link

Fails with out-of-memory for a very-very large pdf file #125

Open LudeeD opened 5 years ago

LudeeD commented 5 years ago

I have a pdf file that is 1.3 Gb in size ( it's a master thesis, that's why I am not annexing it here ) Okular can handle it pretty well but crashes Adobe While trying to use pdfsizeopt it crashes too with a memory error

info: This is pdfsizeopt ZIP rUNKNOWN size=69734.
info: prepending to PATH: /home/ludee/Programs/pdfsizeopt/pdfsizeopt_libexec
info: loading PDF from: /home/ludee/Desktop/Dissertação_Ana_Antunes_201405897.pdf
info: loaded PDF of 1322590721 bytes
info: separated to 2269032 objs + xref + trailer
Traceback (most recent call last):
  File "/proc/self/exe/runpy.py", line 162, in _run_module_as_main
  File "/proc/self/exe/runpy.py", line 72, in _run_code
  File "./pdfsizeopt.single/__main__.py", line 1, in <module>
  File "./pdfsizeopt.single/m.py", line 6, in <module>
  File "./pdfsizeopt.single/pdfsizeopt/main.py", line 5622, in main
  File "./pdfsizeopt.single/pdfsizeopt/main.py", line 2664, in Load
  File "./pdfsizeopt.single/pdfsizeopt/main.py", line 689, in __init__
  File "./pdfsizeopt.single/pdfsizeopt/main.py", line 942, in Get
  File "./pdfsizeopt.single/pdfsizeopt/main.py", line 1217, in ParseDict
  File "./pdfsizeopt.single/pdfsizeopt/main.py", line 1148, in ParseSimpleValue
MemoryError
zvezdochiot commented 5 years ago

@LudeeD say> I have a pdf file that is 1.3 Gb in size

More information please:

pdfinfo /home/ludee/Desktop/Dissertação_Ana_Antunes_201405897.pdf

And see https://github.com/pts/pdfsizeopt/issues/119

LudeeD commented 5 years ago

More info

Title:          
Subject:        
Keywords:       
Author:         
Creator:        LaTeX with hyperref
Producer:       pdfTeX-1.40.19
CreationDate:   Sun Jun 30 21:11:45 2019 WEST
ModDate:        Sun Jun 30 21:11:45 2019 WEST
Tagged:         no
UserProperties: no
Suspects:       no
Form:           none
JavaScript:     no
Pages:          308
Encrypted:      no
Page size:      595.276 x 841.89 pts (A4)
Page rot:       0
File size:      1322590721 bytes
Optimized:      no
PDF version:    1.5

Following instructions on #119 cpdf also failed with a

Initial file size is 1322590721 bytes
Beginning squeeze: 2269033 objects
Fatal error: out of memory.
zvezdochiot commented 5 years ago

@LudeeD say> Pages: 308, File size: 1322590721 bytes

1322590721/308 = 4294128 bytes/page. Hmm! Is big!

You can change /FlateDecode (~ png) to /DCTDecode (~ jpeg), use ghostscript:

ps2pdf /home/ludee/Desktop/Dissertação_Ana_Antunes_201405897.pdf /home/ludee/Desktop/Dissertação_Ana_Antunes_201405897.gs.pdf
LudeeD commented 5 years ago

After running for 3 hours I gave up on this. Rebuilt the PDF with compressed versions of the images and now its in a more reasonable size.

feel free to close this issue if handling > 1Gb files is not really a priority

Thanks for the help

rbrito commented 5 years ago

Can you share this file? It sure sounds interesting and I would like to have a look at it.

Thanks,

Rogério Brito.

Em seg, 1 de jul de 2019 12:54, Luís Silva notifications@github.com escreveu:

I have a pdf file that is 1.3 Gb in size ( it's a master thesis, that's why I am not annexing it here ) Okular can handle it pretty well but crashes Adobe While trying to use pdfsizeopt it crashes too with a memory error

info: This is pdfsizeopt ZIP rUNKNOWN size=69734.

info: prepending to PATH: /home/ludee/Programs/pdfsizeopt/pdfsizeopt_libexec

info: loading PDF from: /home/ludee/Desktop/Dissertação_Ana_Antunes_201405897.pdf

info: loaded PDF of 1322590721 bytes

info: separated to 2269032 objs + xref + trailer

Traceback (most recent call last):

File "/proc/self/exe/runpy.py", line 162, in _run_module_as_main

File "/proc/self/exe/runpy.py", line 72, in _run_code

File "./pdfsizeopt.single/main.py", line 1, in

File "./pdfsizeopt.single/m.py", line 6, in

File "./pdfsizeopt.single/pdfsizeopt/main.py", line 5622, in main

File "./pdfsizeopt.single/pdfsizeopt/main.py", line 2664, in Load

File "./pdfsizeopt.single/pdfsizeopt/main.py", line 689, in init

File "./pdfsizeopt.single/pdfsizeopt/main.py", line 942, in Get

File "./pdfsizeopt.single/pdfsizeopt/main.py", line 1217, in ParseDict

File "./pdfsizeopt.single/pdfsizeopt/main.py", line 1148, in ParseSimpleValue

MemoryError

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/pts/pdfsizeopt/issues/125?email_source=notifications&email_token=AABTZMIXYH56MRBFKPB2MGLP5ISEFA5CNFSM4H4TY5WKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4G4VUHNA, or mute the thread https://github.com/notifications/unsubscribe-auth/AABTZMJBALHMD5X7I67PTXLP5ISEFANCNFSM4H4TY5WA .

zvezdochiot commented 5 years ago

@rbrito say> It sure sounds interesting and I would like to have a look at it.

Use pdftk to process the file in parts.

pts commented 1 year ago

pdfsizeopt indeed uses a lot of memory for large PDF files, because it keeps the parsed version of the entire PDF file in memory. It also keeps multiple versions of compressed image data in memory for the current image being optimized.

Throwing more memory at it should make it work. Unfortunately there is no easy estimate for the total required memory for a given input file.

In the meantime, splitting the PDF file on some page boundary (with pdftk or qpdf), running pdfsizeopt on the split PDF files individually, and joining the results may work for some PDFs.

I'm keeping this issue open as a reminder to add memory optimizations.