patcharats / tesseract-ocr

Automatically exported from code.google.com/p/tesseract-ocr
Other
0 stars 0 forks source link

Unable to load unicharset file #49

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What version of the product are you using? On what operating system?

Windows 2.00

Please provide any additional information below.

Run: tesseract.exe tithe.tif tithe.hope -l enm
log file reads: Unable to load unicharset file C:/tesseract-
2.00/tessdata/enm.unicharset

Looked at file eng.unicharset with hex editor. The file does not contain 
msdos linefeed character 0d only 0a (as in unix?) Nor does it contain the 
3 character strings I think are for utf-8 file format.

So used hex editor to make my enm.unicharset look like eng.unicharset.

Now get log file message:
Error: 32 classes in inttemp while unicharset contains 35 unichars.

Please convert 0001.jpg to tif for full image

Original issue reported on code.google.com by beaumon...@gmail.com on 2 Aug 2007 at 4:58

Attachments:

GoogleCodeExporter commented 9 years ago
Not sure how utf-8 works. So above comment may be wrong.
Changed unicharset file so much that not sure which version is above. So here 
is the 
file output by unicharset_extractor.

Original comment by beaumon...@gmail.com on 2 Aug 2007 at 5:34

Attachments:

GoogleCodeExporter commented 9 years ago
Now tried to run tesseract with original output file from unicharset_extractor 
without editing. This contains 0d 0a as linefeeds.
Log: "Error: 32 classes in inttemp while unicharset contains 37 unichars."
After using hex editor to make some changes, log "Unable to load unicharset 
file"

Original comment by beaumon...@gmail.com on 2 Aug 2007 at 5:48

GoogleCodeExporter commented 9 years ago
Went through whole procedure again this morning.
1 Run tesseract with original output file from unicharset_extractor without 
editing.
  Log file: "Error: 32 classes in inttemp while unicharset contains 37 unichars."
2 Edit unicharset with Notepad. Change first character line from "NULL b" to 
"NULL 0"
  Log file: Unable to load unicharset file C:/tesseract-2.00/tessdata/enm.unicharset.

Is it the software or is it my box files? Now thinking of simplify box-files to 
reduce "APPLY_BOXES: FATALITY" messages. What does this mean. Presumably unable 
to  
interpret character in box. But are these messages fatal for the end result? 
Does 
the box file have to be changed?

Also problem of how to edit unicharset in MsWindows.
Should this file be saved in UTF-8 format?
Interesting! When I save the file as ANSI, Tess goes back to the error message 
at 1.
So it's not expecting a UTF-8 file?
Where are you Ray? Please reply soon!! I'm DESPERATE.

Original comment by beaumon...@gmail.com on 3 Aug 2007 at 9:45

GoogleCodeExporter commented 9 years ago
OK,
All FATALs removed. Now have:
"Error: 30 classes in inttemp while unicharset contains 32 unichars."
How do I find these 2 "missing" characters?

Original comment by beaumon...@gmail.com on 3 Aug 2007 at 10:41

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
Hurray Hurray !!
By extracting characters from normproto & sorting, then sorting unicharset
discovered the 2 problem characters. They are in unicharset as xc h
where x is non-printing char, c is char & h is hex. Delete these 2, change 
count, 
save as ANSI & thank God it finally works. Only took me 24 hours to get to 
this!!
Now only rubbish coming out with 3 training files. Can now return to training. 
Hope 
springs eternal!! Though the worry is that my image might be non-trainable.

Original comment by beaumon...@gmail.com on 3 Aug 2007 at 11:45

GoogleCodeExporter commented 9 years ago
Fixed, or at least greatly improved, in 2.01.
As long as you look for errors from applybox (running with box.train) now 
documented
in the wiki, everything should be smoother...

Original comment by theraysm...@gmail.com on 30 Aug 2007 at 7:55