michaelethompson / tesseract-ocr

Automatically exported from code.google.com/p/tesseract-ocr
Other
0 stars 0 forks source link

Unicharambigs Doesn't Work if Boxfile has Multiple Characters Defined in a Box #906

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1.Create a boxfile for a page image that contains the long-s+h ligature. This 
ligature is not defined in unicode as a single character. (It is defined int he 
MUFI encoding of unicode as U+EAB1.) Create a single box for that ligature, and 
then give it the value "U+017FU+0068".

2. Add some lines to the unicharabmigs file for that language where you try to 
force it to change the long-s+h into "sh". I tried saying that long-s+h should 
be changed into "sh" and that just any long-s should be changed into "s". 

3. Then trying changing the boxfile so that instead of having the ligature 
defined as two characters, it's defined as one. In my case I used U+EAB1. Then 
add this to the unicharabmigs file and say that it should always be turned into 
"sh".

What is the expected output? What do you see instead?
For case 2 above, it did not turn the long-s+h into an "sh" in my output. It 
did turn every other long-s or long-s based ligature into an "s" or an "s"+some 
other character, depending. All the other ligatures were defined as single 
characters in my boxfiles.

For case 3, the unicharambigs successfully turned the long-s+h ligature into an 
"sh".

What version of the product are you using? On what operating system?
version 3.02 on a Max OSX.8.3.

Please provide any additional information below.
I'm attaching a tif image and boxfile for a page image that contains two 
long-s+h ligatures as well as other long-s and long-s based ligatures. I'm also 
attaching my unicharabmigs file.

Original issue reported on code.google.com by matt.chr...@gmail.com on 10 May 2013 at 9:01

Attachments:

GoogleCodeExporter commented 9 years ago
Can you give me an example unicharambigs line for case 2? Was it something like 
this:

1   ſh 2   s h 1

Am I correct in thinking that case 3 worked as you expected? So using one 
codepoint works fine, the issue is when you're using two. Just want to check 
I'm understanding correctly.

Original comment by nick.wh...@durham.ac.uk on 11 Dec 2013 at 2:50