Open rafaelgc80 opened 3 years ago
First of all - you use not support tesseract version. Next: provide original image for testing.
Thank you for your reply, @zdenop
First of all - you use not support tesseract version.
I am aware of that, but I can't upgrade to newer version of tesseract for the moment.
Next: provide original image for testing.
I just edited the post with the pre-processed image.
If you can not update to the latest version, then your report is breaking the basic rule for creating issue.
Environment
Current Behavior:
I am trying to convert to text some subtitles with very close lines (and sometimes even overlapping). Tesseract is failing to recognize the text correctly.
Sometimes it joins the upper and lower letter to produce a single letter. Other times it splits a diacritic from its letter and interprets it as a point or comma. The following example illustrates both cases:
Preprocessed image: Processed image: Tesseract output: Saeratu. . . Uz laUSĪSI manus notelkumus?
The upper p and lower k are recognized together as an 'e', and the i dots of the lower line are interpreted as points in the upper line.
In the following example, the lines do not touch or overlap, but the closeness is misleading tesseract:
Preprocessed image: Processed image: Tesseract output: Nestāvļe; kā tāpls payiāns un pasutlet "MIIIer the" alu.
As before, diacritics in the lower line are taken as part of upper line letters.
I tried, as suggested in the documentation, adjusting the parameter textord_min_linesize. I found that with an optimal value of 1.75, many errors were corrected, but a lot of them, including the two above, still remain. Greater values of this parameter produced garbage in the output.
I am using a training set with several types of arial and helvetica fonts. Modifying the set didn't lead to better results.
I really would not want to get into a preprocessing development to cut the images, because a clean cut is very hard to find, and in some cases, non-existent.
Expected Behavior:
That tesseract will succesfully convert the images with no need to pre-cut the images.
Any suggestions?
Thanks in advance.