Multiple conversion errors when analyzing overlapping lines or too close to each other

rafaelgc80 commented 3 years ago

Environment

Tesseract Version: 3.05.02
Platform: Debian 9.12 64-bit

Current Behavior:

I am trying to convert to text some subtitles with very close lines (and sometimes even overlapping). Tesseract is failing to recognize the text correctly.

Sometimes it joins the upper and lower letter to produce a single letter. Other times it splits a diacritic from its letter and interprets it as a point or comma. The following example illustrates both cases:

Preprocessed image: sub1_pre Processed image: sub1 Tesseract output: Saeratu. . . Uz laUSĪSI manus notelkumus?

The upper p and lower k are recognized together as an 'e', and the i dots of the lower line are interpreted as points in the upper line.

In the following example, the lines do not touch or overlap, but the closeness is misleading tesseract:

Preprocessed image: sub2_pre Processed image: sub2 Tesseract output: Nestāvļe; kā tāpls payiāns un pasutlet "MIIIer the" alu.

As before, diacritics in the lower line are taken as part of upper line letters.

I tried, as suggested in the documentation, adjusting the parameter textord_min_linesize. I found that with an optimal value of 1.75, many errors were corrected, but a lot of them, including the two above, still remain. Greater values of this parameter produced garbage in the output.

I am using a training set with several types of arial and helvetica fonts. Modifying the set didn't lead to better results.

I really would not want to get into a preprocessing development to cut the images, because a clean cut is very hard to find, and in some cases, non-existent.

Expected Behavior:

That tesseract will succesfully convert the images with no need to pre-cut the images.

Any suggestions?

Thanks in advance.

zdenop commented 3 years ago

First of all - you use not support tesseract version. Next: provide original image for testing.

rafaelgc80 commented 3 years ago

Thank you for your reply, @zdenop

First of all - you use not support tesseract version.

I am aware of that, but I can't upgrade to newer version of tesseract for the moment.

Next: provide original image for testing.

I just edited the post with the pre-processed image.

zdenop commented 3 years ago

If you can not update to the latest version, then your report is breaking the basic rule for creating issue.

tesseract-ocr / tesseract