gkovacs / pdfocr

Adds text to PDF files using the cuneiform OCR software
MIT License
324 stars 49 forks source link

Support for black and white, and grayscale pdf files #23

Open snowboard975 opened 9 years ago

snowboard975 commented 9 years ago

Currently, pdfocr converts b/w and grayscale pdf to ppm format in color and runs tesseracts on them. Therefore the output file size of pdfocr is about 10 to 100 times bigger than the the input file in case of b/w pdf files. But there is a method to reduce the file size.

Line 331 on the newest version of pdfocr says, sh "pdftoppm -r 300 #{shell_escape(basefn)}.pdf >#{shell_escape(basefn)}.ppm"

If this line is replaced with below for b/w format, sh "gs -dSAFER -dBATCH -dNOPAUSE -sDEVICE=pbmraw -r300 -sOutputFile=#{shell_escape(basefn)}.ppm #{shell_escape(basefn)}.pdf"

or if the line is replaced with below for grayscale format, sh "gs -dSAFER -dBATCH -dNOPAUSE -sDEVICE=pgmraw -r300 -sOutputFile=#{shell_escape(basefn)}.ppm #{shell_escape(basefn)}.pdf"

Then, the ppm file is in b/w or grayscale and therefore the output file of pdfocr is much smaller than the current one. But the problem of this is that it always converts the ppm file as b/w or grayscale. So it would be nice if you implement an additional option such as -gray or -mono in pdfocr to separate commands for ppm conversion according to the options.

*ps: pdftoppm also supports -mono and -gray option, but -mono option of pdftoppm reduces the image quality for some reason. So I avoided using -mono option on pdftoppm command. I used gs command instead to avoid the problem.

wilsotc commented 7 years ago

Why not extract the raw image as it's encoded in the PDF? The same file format at the same color depth at the same resolution would be ideal. identify -verbose has the PDF page image type, resolution and geometry information.

As of ver 0.50 pdftoppm produces broken pdfs when mono is used. Maybe it should be abandoned?