Closed AEgit closed 7 years ago
I couldn't reproduce it but added a likely fix anyway for 4.5.4. The error came up because page 4 is blank (possibly due to file corruption) and the logic for a blank PDF given `--force`` was incomplete.
To improve the OCR I suggest trying Tesseract 4 (alpha version) and consulting the documentation on recommended arguments with ocrmypdf for using Tess4 (--pdf-renderer tess4
). If Tess 3 has not been trained with a font it performs poorly; perhaps it's not trained with that. It may also be that this is a technical paper that uses words out of the typical Portuguese dictionary it has and so it "corrects" ambiguous words to the wrong ones. The documentation also has instructions for disabling the Tesseract dictionary.
Thanks for the quick reply. I can confirm that the error messages no longer appear with version 4.5.4.
As you said, the fix didn't change the OCR results, so I will have to play around a bit with Tesseract 4 and see, whether it gives better results.
Thanks again for your help!
You noticed a change in file size. Because I regularly run ocrmypdf on batches of >10k files, I watch any such reports closely.
With --force-ocr
ocrmypdf must rasterize every page and save the rasterized output to a new file. It so happens that the input file is optimized in a way that is lost when the whole page is rasterized, so the output images are 55% larger by pixel count to preserve the original resolution. The average compression ratio is the better in the output file, but not enough to compensate for so many more pixels.
This file looks like it would work without --force-ocr
.
Yes, sorry about that - I initially reported the increase in file size, but then realised that the old file had been ocred without the --force
attribute. Indeed, when rerunning the OCR process without --force
I got a similar file size. That's why I decided to edit my comment.
Another PDF file, which proves a bit difficult to OCR: https://app.box.com/s/ffraogy4ayco5gc87t8kj406ww3o731v
Using
ocrmypdf -l por --force myfile.pdf myfile_ocr.pdf
it was possible to ocr the respective file. However, many pages are very poorly ocred (most sentences are missing, and the ocred parts are completely wrong). The following exceptions are thrown:
I'm just wondering, whether the poor OCR quality for that file is just related to the image quality of the document itself or whether it is related to the above exceptions?
I'm using the current ocrmypdf version 4.5.3.