Trying to recognize a language with just real numbers and the four basic operators ( + - * / )

What steps will reproduce the problem?
1. Use Tesseract 3.01 (3.00 should also do)
2. mftrain and cntrain using the two files attached
3. try to recognize any image with a small expression like: 32.453 - 67.3266

What is the expected output? What do you see instead?
The expected output is the correct recognition of the expression. 
Instead what is happening is that tesseract systematically confuses the . with 
- or 0. And since we are reading math expressions (even if simple) a dictionary 
does not help a lot (32.4 + 5.6 and 32-4 + 506 are both valid expressions).  

What version of the product are you using? On what operating system?
Tesseract 3.01
Ubuntu 11.04 64bit

Please provide any additional information below.

This problem manifests with simple expressions (only + - / *)however, if we 
include operators like log() or sin() the problem becomes even more evident 
because of the letters involved (even if in this case a dictionary would help 
to recognize log() or sin()). 

Moreover, we would like to eventually include currency symbols (like $) and 
measurement units like (km, m, kg, ").

I know that tesseract is optimized for english and after that for languages 
with a different structure than math expressions (not considering multi line 
operators or radicals, I know that is even more complex and there few 
commercial OCRs that deal with them). 

But if you only could give me a methodology of how to build the sample images 
for the expressions (I considered the guidelines proposed in the tesseract 
documentation but for math expressions I'm definitely missing something).

Thanks a lot in advance for your help,

Original issue reported on code.google.com by luis...@gmail.com on 25 Sep 2011 at 10:06

Attachments:

jacklicn / tesseract-ocr

Trying to recognize a language with just real numbers and the four basic operators ( + - * / ) #551