OCR problem: "cannot write mode P as JPEG" exception

AEgit commented 7 years ago

Another PDF file, which proves a bit difficult to OCR: https://app.box.com/s/ffraogy4ayco5gc87t8kj406ww3o731v

Using

ocrmypdf -l por --force myfile.pdf myfile_ocr.pdf

it was possible to ocr the respective file. However, many pages are very poorly ocred (most sentences are missing, and the ocred parts are completely wrong). The following exceptions are thrown:

Original exception:

    Exception #1
      'builtins.OSError(cannot write mode P as JPEG)' raised in ...
       Task = def ocrmypdf.pipeline.select_visible_page_image(...):
       Job  = [[.../000004.page.png, .../000004.pp-background.png, .../000004.pp-clean.png, .../000004.pp-deskew.png] -> .../000004.image, <LoggingProxy>, <ocrmypdf.pipeline.JobContext>]

    Traceback (most recent call last):
      File "/usr/local/lib/python3.4/dist-packages/PIL/JpegImagePlugin.py", line 599, in _save
        rawmode = RAWMODE[im.mode]
    KeyError: 'P'

    During handling of the above exception, another exception occurred:

    Traceback (most recent call last):
      File "/usr/local/lib/python3.4/dist-packages/ruffus/task.py", line 751, in run_pooled_job_without_exceptions
        register_cleanup, touch_files_only)
      File "/usr/local/lib/python3.4/dist-packages/ruffus/task.py", line 567, in job_wrapper_io_files
        ret_val = user_defined_work_func(*params)
      File "/usr/local/lib/python3.4/dist-packages/ocrmypdf/pipeline.py", line 534, in select_visible_page_image
        im.save(output_file, format='JPEG', dpi=dpi)
      File "/usr/local/lib/python3.4/dist-packages/PIL/Image.py", line 1826, in save
        save_handler(self, fp, filename)
      File "/usr/local/lib/python3.4/dist-packages/PIL/JpegImagePlugin.py", line 601, in _save
        raise IOError("cannot write mode %s as JPEG" % im.mode)
    OSError: cannot write mode P as JPEG

I'm just wondering, whether the poor OCR quality for that file is just related to the image quality of the document itself or whether it is related to the above exceptions?

I'm using the current ocrmypdf version 4.5.3.

jbarlow83 commented 7 years ago

I couldn't reproduce it but added a likely fix anyway for 4.5.4. The error came up because page 4 is blank (possibly due to file corruption) and the logic for a blank PDF given `--force`` was incomplete.

To improve the OCR I suggest trying Tesseract 4 (alpha version) and consulting the documentation on recommended arguments with ocrmypdf for using Tess4 (--pdf-renderer tess4). If Tess 3 has not been trained with a font it performs poorly; perhaps it's not trained with that. It may also be that this is a technical paper that uses words out of the typical Portuguese dictionary it has and so it "corrects" ambiguous words to the wrong ones. The documentation also has instructions for disabling the Tesseract dictionary.

AEgit commented 7 years ago

Thanks for the quick reply. I can confirm that the error messages no longer appear with version 4.5.4.

As you said, the fix didn't change the OCR results, so I will have to play around a bit with Tesseract 4 and see, whether it gives better results.

Thanks again for your help!

jbarlow83 commented 7 years ago

You noticed a change in file size. Because I regularly run ocrmypdf on batches of >10k files, I watch any such reports closely.

With --force-ocr ocrmypdf must rasterize every page and save the rasterized output to a new file. It so happens that the input file is optimized in a way that is lost when the whole page is rasterized, so the output images are 55% larger by pixel count to preserve the original resolution. The average compression ratio is the better in the output file, but not enough to compensate for so many more pixels.

This file looks like it would work without --force-ocr.

AEgit commented 7 years ago

Yes, sorry about that - I initially reported the increase in file size, but then realised that the old file had been ocred without the --force attribute. Indeed, when rerunning the OCR process without --force I got a similar file size. That's why I decided to edit my comment.

ocrmypdf / OCRmyPDF

OCR problem: "cannot write mode P as JPEG" exception #151