Closed GoogleCodeExporter closed 9 years ago
Not sure how utf-8 works. So above comment may be wrong.
Changed unicharset file so much that not sure which version is above. So here
is the
file output by unicharset_extractor.
Original comment by beaumon...@gmail.com
on 2 Aug 2007 at 5:34
Attachments:
Now tried to run tesseract with original output file from unicharset_extractor
without editing. This contains 0d 0a as linefeeds.
Log: "Error: 32 classes in inttemp while unicharset contains 37 unichars."
After using hex editor to make some changes, log "Unable to load unicharset
file"
Original comment by beaumon...@gmail.com
on 2 Aug 2007 at 5:48
Went through whole procedure again this morning.
1 Run tesseract with original output file from unicharset_extractor without
editing.
Log file: "Error: 32 classes in inttemp while unicharset contains 37 unichars."
2 Edit unicharset with Notepad. Change first character line from "NULL b" to
"NULL 0"
Log file: Unable to load unicharset file C:/tesseract-2.00/tessdata/enm.unicharset.
Is it the software or is it my box files? Now thinking of simplify box-files to
reduce "APPLY_BOXES: FATALITY" messages. What does this mean. Presumably unable
to
interpret character in box. But are these messages fatal for the end result?
Does
the box file have to be changed?
Also problem of how to edit unicharset in MsWindows.
Should this file be saved in UTF-8 format?
Interesting! When I save the file as ANSI, Tess goes back to the error message
at 1.
So it's not expecting a UTF-8 file?
Where are you Ray? Please reply soon!! I'm DESPERATE.
Original comment by beaumon...@gmail.com
on 3 Aug 2007 at 9:45
OK,
All FATALs removed. Now have:
"Error: 30 classes in inttemp while unicharset contains 32 unichars."
How do I find these 2 "missing" characters?
Original comment by beaumon...@gmail.com
on 3 Aug 2007 at 10:41
[deleted comment]
Hurray Hurray !!
By extracting characters from normproto & sorting, then sorting unicharset
discovered the 2 problem characters. They are in unicharset as xc h
where x is non-printing char, c is char & h is hex. Delete these 2, change
count,
save as ANSI & thank God it finally works. Only took me 24 hours to get to
this!!
Now only rubbish coming out with 3 training files. Can now return to training.
Hope
springs eternal!! Though the worry is that my image might be non-trainable.
Original comment by beaumon...@gmail.com
on 3 Aug 2007 at 11:45
Fixed, or at least greatly improved, in 2.01.
As long as you look for errors from applybox (running with box.train) now
documented
in the wiki, everything should be smoother...
Original comment by theraysm...@gmail.com
on 30 Aug 2007 at 7:55
Original issue reported on code.google.com by
beaumon...@gmail.com
on 2 Aug 2007 at 4:58Attachments: