Open spajak opened 4 years ago
Cheers,
I think you'll have to play with the control parameters, and find what's best suited for your scenario.
I'm pretty happy with the results I have by using the textord_space_size_is_variable=1
param.
Probably you can also play with the dictionary penalties (ex. language_model_penalty_non_dict_word
param).
Use tesseract.exe --print-parameters
to get the full list of available parameters.
Hope it helps.
Environment
tesseract v5.0.0-alpha.20191030 Windows 64bit
Current Behavior:
Sometimes words are broken unnecessary by space (proj ect). Sometimes words are concatenated (aproject). Sometimes separate paragraphs are concatenated, other times they are broken inside by new line. It depends on the source of course, and this is quite rare. But.. How can I control this behavior? How to bias the engine into separating words/paragraphs more often or less often?
Suggested Fix:
Parameters like:
word_space_bias
-n:0:n
,paragraph_break_bias
-n:0:n
. Wheren
is a number. Would be nice.There are of course other possibilities; some factors, some limits etc.