ocrmypdf / OCRmyPDF

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
http://ocrmypdf.readthedocs.io/
Mozilla Public License 2.0
14.26k stars 1.02k forks source link

--threshold-final #550

Open femifrak opened 4 years ago

femifrak commented 4 years ago

I often have pdfs with only text but scanned in gray and would like to binarize them to b/w in the output for better contrast in the ereader. Is there a way to use the "--threshold" parameter for the final output, like with --clean and --clean-final? That would be perfect!

Or do you know another good way for binarization? Just one single threshold for one page often removes light lines and changes the letters' morphology.

jbarlow83 commented 4 years ago

I agree it would be, but I've found that most threshold functions are not reliable enough to trust without manual inspection of the results. It could ruin a good scanned document if the threshold is wrong.

See http://www.leptonica.org/binarization.html for some discussion on thresholding algorithms if you are interested. Otsu is good enough for the typical case, Sauvola is sometimes better, none are perfect. I'm not up to speed on any newer methods.

The worst case is when the background is very noisy and has a wide dynamic range. Some older paper seems to have a lot of grain that ends up scanning in exactly this way, especially if the text has faded too.

femifrak commented 4 years ago

oops, my editing has overlapped with your message. i just thought, because "- threshold" already exists, that there is also a simple way to output the already existing binarized pages ...

erd82 commented 1 year ago

@femifrak / @jbarlow83: Did you already find any solution for this problem? I currently have a similar issue that text in light grey is not recognized at all. Example: image Cheers, erd

femifrak commented 1 year ago

sorry for the late reply. unfortunately I have no solution. Your case is really hard as some noise seems to have similar "darkness" as you text.