gkovacs / pdfocr

Adds text to PDF files using the cuneiform OCR software
MIT License
325 stars 49 forks source link

compress pdf file #28

Open cbjcbj opened 7 years ago

cbjcbj commented 7 years ago

Hi, my input file of pdfocr is ~9M and my output file is about 390M, and I try to used pdftk to compress but the compress rate is less than 0.1%. So I wonder it is possible to compress the pdf file. Thank you. pic

wilsotc commented 7 years ago

This problem is being caused by pdftoppm. I worked around it by bypassing this utility.

cbjcbj commented 7 years ago

So is it possible to convert ppm to jpg or something else and make the pdf file smaller?

wilsotc commented 7 years ago

Yes. You can often skip the ppm format step though. PDF allows you to encode an image using other image formats including jpeg. The poppler utility pdfimages extracts the PDF encoded image in its native format, resolution, and color depth. When the image format isn't supported by your OCR software, you can fall back to conversion. When it is supported, there's no loss of PDF compression efficiency.

On Wed, Jan 4, 2017 at 12:57 AM, cbjcbj notifications@github.com wrote:

So is it possible to convert ppm to jpg or something else and make the pdf file smaller?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/gkovacs/pdfocr/issues/28#issuecomment-270268762, or mute the thread https://github.com/notifications/unsubscribe-auth/AJedU2ekENcyhKC5f0O_nRez9J2KFVvKks5rOu5hgaJpZM4K7NTO .

cbjcbj commented 7 years ago

Thank you. You said by skipping the ppm step. Do you mean I should change some lines of the ruby code or I should add some command line parameters instead of pdfocr -i input.pdf -o output.pdf?

wilsotc commented 7 years ago

You would need to change the ruby code to bypass the pdftoppm utility. The pdfimages utility is the more ideal route, but the imagemagick convert utility could also be better. The bottom line is that the pdftoppm utility is nearly worthless for monochrome pdf scanned documents.

On Wed, Jan 4, 2017 at 3:00 AM, cbjcbj notifications@github.com wrote:

Thank you. You said by skipping the ppm step. Do you mean I should change some lines of the ruby code or I should add some command line parameters instead of pdfocr -i input.pdf -o output.pdf?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/gkovacs/pdfocr/issues/28#issuecomment-270284417, or mute the thread https://github.com/notifications/unsubscribe-auth/AJedU5Znm6-gem5cSotTxzO-knVpRBfuks5rOwtHgaJpZM4K7NTO .

cbjcbj commented 7 years ago

OK, thank you. I don't know ruby and I will have a try.

wilsotc commented 7 years ago

fix.txt

wilsotc commented 7 years ago

This PERL script extracts all page images in their native format if they're JPG, PNG, or TIFF using the pdfimages utility. You could use it as the basis for a more capable converter. test.zip

cbjcbj commented 7 years ago

Thank you, I will have a try:)

wilsotc commented 7 years ago

the syntax for the PERL script is: test2.pl -i

wodin commented 7 years ago

A PDF I am trying this on actually has multiple images making up each page of the PDF. I don't know what was used to scan the PDF, but most of these images are actually tiny PBMs corresponding to various small marks on the page. The text is stored in one or a few PBMs or JPEGs per page.

For PDFs like this, I'm not sure whether it would be best to run the OCR engine on each image individually or to convert each page to a complete image as is done currently before running it through the OCR engine, but either way, I would prefer it if the text could be incorporated into a copy of the original PDF instead of using the exported images and the text to create the output.

I don't know how feasible this would be, but if possible, that seems like it would be a good way to do it.