ocrmypdf / OCRmyPDF

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
http://ocrmypdf.readthedocs.io/
Mozilla Public License 2.0
14.12k stars 1.02k forks source link

ColorSpace-Indexed-ICCBased-DeviceGray converted to RGB #357

Open drboone opened 5 years ago

drboone commented 5 years ago

Describe the issue I'm reporting this much larger output file as requested by the program. If I extract all of the scanned page images from the attached pdf using pdfimages, they come out as .pbm files. However, if I do the same to the pdf produced by ocrmypdf, they come out as .ppm files. Hopefully the attachmed pdf helps you track down whatever bizarre case I've managed to create.

   INFO -    4: [tesseract] Image too small to scale!! (2x36 vs min width of 3)
   INFO -    4: [tesseract] Line cannot be recognized!!
   INFO -    4: [tesseract] Image too small to scale!! (2x36 vs min width of 3)
   INFO -    4: [tesseract] Line cannot be recognized!!
WARNING -    3: [tesseract] lots of diacritics - possibly poor OCR
   INFO - Optimize ratio: 1.00 savings: 0.0%
   INFO - Output file is a PDF/A-2B (as expected)
WARNING - The output file size is 2.25× larger than the input file.
No reason for this increase is known.  Please report this issue.

To Reproduce ocrmypdf "1st Solutions July 1985.pdf" out/"1st Solutions July 1985,pdf

Example file Culprit pdf is attached

Please check any or all that apply about the test file:

Expected behavior A clear and concise description of what you expected to happen. Include screenshots if applicable.

System:

Additional context Add any other context about the problem here. 1st Solutions July 1985.pdf

jbarlow83 commented 5 years ago

Thank you.

The issue is that the images are marked as having a complex colorspace that ocrmypdf does not recognize, so it takes the precaution of assuming the colorspace is RGB and upgrades all of the images from monochrome to RGB.

You could work around this with pdfimages by outputting to monochrome and then repacking as a PDF.

Not sure when I'll be able to address this.

drboone commented 5 years ago

Yes, I rebuilt the PDF trivially. Thanks for looking!