Devanagari and Tamil - Recognition different for tam+san vs san+tam

gnewtothis101 / tesseract-ocr

Automatically exported from code.google.com/p/tesseract-ocr

Other

0 stars 0 forks source link

What steps will reproduce the problem?
1. run tesseract with devanagari and tamil scripts traineddata on attached image
2.
3.

What is the expected output? What do you see instead?
The recognition is different based on whether san+tam is used or tam+san is used

What version of the product are you using? On what operating system?
latest version from git, msys2, windows 8

Please provide any additional information below.
tif input and recognized text for both options attached.

Original issue reported on code.google.com by shreeshrii on 15 Oct 2014 at 12:34

Attachments:

page-007.tif
page-007.tif-san+tam-3.txt
page-007.tif-tam+san-3.txt

This is an intended behavior. The first specified language takes priority until the text changes to another, then there is hysteresis. It is highly imperfect, but reasonably efficient that way. It will do better when the overall recognition accuracy is better.

gnewtothis101 / tesseract-ocr

Devanagari and Tamil - Recognition different for tam+san vs san+tam #1344