Open femifrak opened 4 years ago
I agree it would be, but I've found that most threshold functions are not reliable enough to trust without manual inspection of the results. It could ruin a good scanned document if the threshold is wrong.
See http://www.leptonica.org/binarization.html for some discussion on thresholding algorithms if you are interested. Otsu is good enough for the typical case, Sauvola is sometimes better, none are perfect. I'm not up to speed on any newer methods.
The worst case is when the background is very noisy and has a wide dynamic range. Some older paper seems to have a lot of grain that ends up scanning in exactly this way, especially if the text has faded too.
oops, my editing has overlapped with your message. i just thought, because "- threshold" already exists, that there is also a simple way to output the already existing binarized pages ...
@femifrak / @jbarlow83: Did you already find any solution for this problem? I currently have a similar issue that text in light grey is not recognized at all. Example: Cheers, erd
sorry for the late reply. unfortunately I have no solution. Your case is really hard as some noise seems to have similar "darkness" as you text.
I often have pdfs with only text but scanned in gray and would like to binarize them to b/w in the output for better contrast in the ereader. Is there a way to use the "--threshold" parameter for the final output, like with --clean and --clean-final? That would be perfect!
Or do you know another good way for binarization? Just one single threshold for one page often removes light lines and changes the letters' morphology.