jacklicn / tesseract-ocr

Automatically exported from code.google.com/p/tesseract-ocr
Other
0 stars 0 forks source link

unicharambigs seems to not work. #542

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
I created a trained language for Tesseract (3.01) for the specific purpose of 
having the highest accuracy possible for a single font. The accuracy is 
certainly better (~75% accuracy vs ~25% /w standard training, dictionary words) 
but still not good enough. The biggest issue seems to be that the settings in 
the unicharambigs file seem to not work, unless you tell Tesseract to always 
make the replacement in the last column. 

What is the expected output? What do you see instead?

Expected output for this example is the string "tactless" which is one of the 
words defined in the .traineddata's dictionary. What I am seeing is "tedleSS" 
which is not a dictionary word. I would expect that Tesseract would correctly 
use unicharambigs to change the first 'e' to an 'a' and 'd' to 'ct' since the 
following rules are in the config file.

2   c t 1   d   0
1   e   1   a   0

What version of the product are you using? On what operating system?
3.01, Windows and Linux

Please provide any additional information below.
Attached image of OCR text. Note standard training also gets this wrong with 
the result "taciless"

Original issue reported on code.google.com by baraba...@gmail.com on 30 Aug 2011 at 9:05

Attachments:

GoogleCodeExporter commented 9 years ago
I did a typo in the example.. the actual file does actually have the correct 
line

1   d   2   c t 0

Original comment by baraba...@gmail.com on 31 Aug 2011 at 6:33

GoogleCodeExporter commented 9 years ago
bump o.o

Original comment by kopanda0...@gmail.com on 23 Jun 2012 at 1:45

GoogleCodeExporter commented 9 years ago
Issue 719 has been merged into this issue.

Original comment by zde...@gmail.com on 21 Jul 2012 at 3:49

GoogleCodeExporter commented 9 years ago
Have the same problem. With the addition: I am sure that I have included the 
unicharambigs correctly, because replacements work when set to mandatory (type 
1) - but that's not the desired solution, of course.

Original comment by martin.s...@illusion-factory.de on 23 Dec 2012 at 9:24

GoogleCodeExporter commented 9 years ago
I am still having the same issues as before.  Tesseract should compare the 
output to the dawg files to get rid of extra spacing in the middle of words and 
put a space when two words run together.  It should look for optional 
substitutions as well.  Dawg files do not have desired affect, please fix.   

Original comment by mattt...@gmail.com on 3 Jan 2013 at 7:54

GoogleCodeExporter commented 9 years ago
I am also having same issue.
Have any one fixed this issue...

Original comment by dharmend...@gmail.com on 5 Mar 2013 at 11:08

GoogleCodeExporter commented 9 years ago
Same here, added couple of rules to eng.unicharambigs and made sure it's 
combined correctly. Still works only if I force the substitution by setting the 
last column to "1".

Original comment by remon.sh...@gmail.com on 22 Oct 2013 at 11:34

GoogleCodeExporter commented 9 years ago
Ideally, the other optional toggles only function best when there are 
supporting files such as the dictionary, freuently used words, etc etc.
I had to use the 1 because I didn't have these files prepared.

Original comment by boydtw...@gmail.com on 22 Oct 2013 at 9:01

GoogleCodeExporter commented 9 years ago
There seems to be two issues at hand here.

First there's the issue of "type 0" (optional) ambigs, which seem to be ignored.
But as it has been pointed out, these are likely working as intended, and are 
simply not being selected because they're deemed unlikely.

Second, there currently appears to be a bug involving multi-char ambigs.
I'll leave out the messier details, but the gist is that such rules will 
silently fail parsing and therefore be ignored at runtime.

I've created a patch that should take care of this, and makes both types of 
multi-char ambig rule successfully parse/load.

If you want to check and verify this bug and it's patch, try run tesseract with 
a config file including "ambigs_debug_level 3".
You should see which lines load and which don't - the latter with a message 
along the line of "Illegal unichar ...".

Original comment by clements...@gmail.com on 4 Jan 2014 at 3:16

Attachments:

GoogleCodeExporter commented 9 years ago
is this patch included in the latest source in git?

Original comment by shreeshrii on 16 Oct 2014 at 2:48