Open snowboard975 opened 9 years ago
Why not extract the raw image as it's encoded in the PDF? The same file format at the same color depth at the same resolution would be ideal. identify
As of ver 0.50 pdftoppm produces broken pdfs when mono is used. Maybe it should be abandoned?
Currently, pdfocr converts b/w and grayscale pdf to ppm format in color and runs tesseracts on them. Therefore the output file size of pdfocr is about 10 to 100 times bigger than the the input file in case of b/w pdf files. But there is a method to reduce the file size.
Line 331 on the newest version of pdfocr says, sh "pdftoppm -r 300 #{shell_escape(basefn)}.pdf >#{shell_escape(basefn)}.ppm"
If this line is replaced with below for b/w format, sh "gs -dSAFER -dBATCH -dNOPAUSE -sDEVICE=pbmraw -r300 -sOutputFile=#{shell_escape(basefn)}.ppm #{shell_escape(basefn)}.pdf"
or if the line is replaced with below for grayscale format, sh "gs -dSAFER -dBATCH -dNOPAUSE -sDEVICE=pgmraw -r300 -sOutputFile=#{shell_escape(basefn)}.ppm #{shell_escape(basefn)}.pdf"
Then, the ppm file is in b/w or grayscale and therefore the output file of pdfocr is much smaller than the current one. But the problem of this is that it always converts the ppm file as b/w or grayscale. So it would be nice if you implement an additional option such as -gray or -mono in pdfocr to separate commands for ppm conversion according to the options.
*ps: pdftoppm also supports -mono and -gray option, but -mono option of pdftoppm reduces the image quality for some reason. So I avoided using -mono option on pdftoppm command. I used gs command instead to avoid the problem.