mithilesh1125 / tesseract-ocr

Automatically exported from code.google.com/p/tesseract-ocr
Other
0 stars 0 forks source link

Error: Size of unicharset is greater than MAX_NUM_CLASSES when running trained font #670

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Hello,
I am trying to train MS Mincho font ~14k characters for recognizing only that 
Japanese font.
I can train it through all the steps but once I try to run it I have the 
following error:
$ tesseract myjap.mincho.exp0.tif output -l myjap
Error: Size of unicharset is greater than MAX_NUM_CLASSES
Failed loading language 'myjap'
Tesseract couldn't load any languages!
Could not initialize tesseract.

I am using the latest svn revision: 715

I don't know if it's related, but in the "shapetable" there are only one third 
of the shapes.

I'm attaching the tif file and the box file (compressed in train.zip).
font_properties content: mincho 0 0 0 0 0
The list of commands executed to train is:
tesseract myjap.mincho.exp0.tif myjap.mincho.exp0 nobatch box.train
unicharset_extractor myjap.mincho.exp0.box
shapeclustering -F font_properties -U unicharset myjap.mincho.exp0.tr
mftraining -F font_properties -U unicharset -O myjap.unicharset 
myjap.mincho.exp0.tr
cntraining myjap.mincho.exp0.tr
combine_tessdata myjap.
tesseract myjap.mincho.exp0.tif output -l myjap

if you need any other information please let me know.
Thanks

Original issue reported on code.google.com by andy.bia...@gmail.com on 30 Mar 2012 at 11:03

Attachments:

GoogleCodeExporter commented 9 years ago
Try the solution in 
Issue 743:  mftraining segmentation fault with large 13,000+ character set

Curious did you hit other problems with large number of characters

Original comment by whoister...@gmail.com on 15 Aug 2012 at 2:19

GoogleCodeExporter commented 9 years ago
Just increase MAX_NUM_CLASSES for your build.

I don't want to do this for everyone, as it increases memory requirements.
Hopefully, at some point in the future it will become dynamic and "unlimited", 
but not in 3.02 or 3.03.

Original comment by theraysm...@gmail.com on 21 Sep 2012 at 12:24