How to control words break (space) and paragraphs break (new line) bias?

tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)

Apache License 2.0

62.28k stars 9.51k forks source link

Environment

tesseract v5.0.0-alpha.20191030 Windows 64bit

Current Behavior:

Sometimes words are broken unnecessary by space (proj ect). Sometimes words are concatenated (aproject). Sometimes separate paragraphs are concatenated, other times they are broken inside by new line. It depends on the source of course, and this is quite rare. But.. How can I control this behavior? How to bias the engine into separating words/paragraphs more often or less often?

Suggested Fix:

Parameters like: word_space_bias -n:0:n, paragraph_break_bias -n:0:n. Where n is a number. Would be nice.

There are of course other possibilities; some factors, some limits etc.

tesseract-ocr / tesseract