Platform: Linux jk-XPS-13 5.0.0-25-generic #26-Ubuntu SMP Thu Aug 1 12:04:58 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Current Behavior:
This page of a book gets recognized perfectly except for „die Sache selbst“ (end of third line), which becomes „die Sache selbst‘ (single quote). The other single quote becomes another block containing only the very small character "C"
I'm sorry I could not provide a cropped test image, but for smaller regions the problem disappears.
I'm calling tesseract with default parameters:
tesseract testcase.png - -l deu
When called with psm 6 (single uniform block of text) it works, but I don't want to loose the layout information.
tesseract testcase.png - -l deu -psm 6
This is of course a minor bug, but maybe it's also easy to fix. It happens like one time in hundred pages. Sometimes footnote numbers get lost the same way.
The problem appears at least with tesseract 4.0 / 4.1 / master and in all oem modes.
Expected Behavior:
tesseract should not split of single chars in extra regions
Hi, I would like to work on this issue. I believe I can help resolve it. Please let me know if there are any specific guidelines or considerations I should keep in mind while working on this.
Environment
Current Behavior:
This page of a book gets recognized perfectly except for „die Sache selbst“ (end of third line), which becomes „die Sache selbst‘ (single quote). The other single quote becomes another block containing only the very small character "C"
I'm sorry I could not provide a cropped test image, but for smaller regions the problem disappears.
I'm calling tesseract with default parameters:
tesseract testcase.png - -l deu
When called with psm 6 (single uniform block of text) it works, but I don't want to loose the layout information.
tesseract testcase.png - -l deu -psm 6
This is of course a minor bug, but maybe it's also easy to fix. It happens like one time in hundred pages. Sometimes footnote numbers get lost the same way. The problem appears at least with tesseract 4.0 / 4.1 / master and in all oem modes.
Expected Behavior:
tesseract should not split of single chars in extra regions
Suggested Fix:
Maybe padding the recognized blocks a bit?