tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
62.28k stars 9.51k forks source link

How to control words break (space) and paragraphs break (new line) bias? #2779

Open spajak opened 4 years ago

spajak commented 4 years ago

Environment

tesseract v5.0.0-alpha.20191030 Windows 64bit

Current Behavior:

Sometimes words are broken unnecessary by space (proj ect). Sometimes words are concatenated (aproject). Sometimes separate paragraphs are concatenated, other times they are broken inside by new line. It depends on the source of course, and this is quite rare. But.. How can I control this behavior? How to bias the engine into separating words/paragraphs more often or less often?

Suggested Fix:

Parameters like: word_space_bias -n:0:n, paragraph_break_bias -n:0:n. Where n is a number. Would be nice.

There are of course other possibilities; some factors, some limits etc.

edi33416 commented 4 years ago

Cheers,

I think you'll have to play with the control parameters, and find what's best suited for your scenario.

I'm pretty happy with the results I have by using the textord_space_size_is_variable=1 param. Probably you can also play with the dictionary penalties (ex. language_model_penalty_non_dict_word param).

Use tesseract.exe --print-parameters to get the full list of available parameters.

Hope it helps.