Trouble with "separator lines" made of **** or ----- or =======

tesseract-ocr / langdata

Source training data for Tesseract for lots of languages

Apache License 2.0

826 stars 886 forks source link

I have noticed that when scanning documents where the old practice (going back to typewriters) to use rows of asterisks or equal signs as text separators, tesseract performs poorly.

Esample

          Some document 

Line one

**************************

Line two

===  ===  === === ===  ===  ===

Line three

On something like this, tesseract would strive to match the line made of asterisks or equal signs to some text, resulting in things like EERKKRKKERKKREAKREKKAKRKKKAK or RRR RRR NETT RRR RRR which is not typically the desired outcome.

It is my understanding that the issue might likely come from the training data rather than the engine itself. If this is so, I wonder if the training sets could be augmented to consider these cases.

tesseract-ocr / langdata

Trouble with "separator lines" made of **** or ----- or ======= #301