Closed GoogleCodeExporter closed 9 years ago
I have the same issue with Tesseract 3.02
Original comment by andy.bia...@gmail.com
on 6 Apr 2012 at 12:25
Problem is that you do not follow instruction:
http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3#Generate_Training
_Images :
* Make sure there are a minimum number of samples of each character. 10 is
good, but 5 is OK for rare characters.
* There should be more samples of the more frequent characters - at least 20.
* Make the text more realistic.
My experience: if I see "no protos/configs for xyz" - it means there is not
enough examples of xyz in input image/box.
Also it is not suggested to use images with DPI below 200 (avaya.avaya.exp0.tif
is 96 DPI)... So I suggest to fix input image.
Original comment by zde...@gmail.com
on 6 Apr 2012 at 9:37
I'm trying to train the licence plate number of our country. I've got the
official font of all the characters. I also have a problem stated like Issue
557. But is it truly necessary and reasonable to "Make sure there are a minimum
number of samples of each character"? In my case I just have 3 tif files at
hand and that's all I need. I'm frustrated at training such a tiny language.
See the accessary for my 3 tif files and their box files I made.
Thanks.
Original comment by xyxzfj@gmail.com
on 16 May 2012 at 2:01
Attachments:
Oh my! I've finally got it out!
I didn't know why smr(comment 1)'s and also mine previous problem
"Warning: no protos/configs for ' in CreateIntTemplates()
Warning: no protos/configs for : in CreateIntTemplates()
Error: no configs for class ' in mftraining
Error: no configs for class : in mftraining"
occured, and I doubted the zde(comment 2)'s reason in comment 3.
In my previous trial, I used 3 speperated tif files for digits, alphas and
chineses characters. And I didn't follow the [lang].[fontname].exp[num] rule
since I thought the bracked parts are optional; I was using cnlp.exp0,
cnlp.exp1, cnlp.exp2. And the problem like comment 1 occured.
Now, I merged the 3 tif files into one, cnlp.lpft.exp10.tif. And do the
following:
Make Box Files:
tesseract cnlp.lpft.exp09.tif cnlp.lpft.exp09 batch.nochop makebox
(Here I used exp09.tif instead of exp10.tif, is in order to avoid getting a bad
box file that takes part of some of my characters as a independent character
since lots of my characters and made up of isolated radicals like
艹、亠、一)
Run Tesseract for Training:
tesseract cnlp.lpft.exp10.tif cnlp.lpft.exp10 nobatch box.train
Compute the Character Set:
unicharset_extractor cnlp.lpft.exp10.box
font_properties:(content of the file "font_properties": lpft 0 0 1 0 0)
mftraining -F font_properties -U unicharset cnlp.lpft.exp10.tr
Clustering:
mftraining -F font_properties -U unicharset -O cnlp.unicharset
cnlp.lpft.exp10.tr
cntraining cnlp.lpft.exp10.tr
Clustering:(empty)
The last file (unicharambigs):(none)
Putting it all together:(I've added prefix "cnl." to normproto, Microfeat,
inttemp, pffmtable and unicharset)
combine_tessdata cnl.
TEST:
I used the cnl.traineddata to test my cnlp.lpft.exp10.tif:
tesseract cnlp.lpft.exp10.tif cnlp.txt -l cnl
RESULT:
Tesseract Open Source OCR Engine v3.01 with Leptonica
TIFFReadDirectory: Warning, TIFFstream: invalid TIFF directory; tags are not sor
ted in ascending order.
TIFFReadDirectory: Warning, TIFFstream: unknown field with tag 20624 (0x5090) en
countered.
TIFFReadDirectory: Warning, TIFFstream: unknown field with tag 20625 (0x5091) en
countered.
TIFFReadDirectory: Warning, TIFFstream: unknown field with tag 40092 (0x9c9c) en
countered.
TIFFReadDirectory: Warning, TIFFstream: invalid TIFF directory; tags are not sor
ted in ascending order.
TIFFReadDirectory: Warning, TIFFstream: unknown field with tag 20624 (0x5090) en
countered.
TIFFReadDirectory: Warning, TIFFstream: unknown field with tag 20625 (0x5091) en
countered.
TIFFReadDirectory: Warning, TIFFstream: unknown field with tag 40092 (0x9c9c) en
countered.
TIFFReadDirectory: Warning, TIFFstream: invalid TIFF directory; tags are not sor
ted in ascending order.
TIFFReadDirectory: Warning, TIFFstream: unknown field with tag 20624 (0x5090) en
countered.
TIFFReadDirectory: Warning, TIFFstream: unknown field with tag 20625 (0x5091) en
countered.
TIFFReadDirectory: Warning, TIFFstream: unknown field with tag 40092 (0x9c9c) en
countered.
Page 0
CNLP.TXT:
12345
67890
ABCD
HIJK
OPQR
SLE
TMF
UNG
VWXYZ
京津冀晋蒙辽吉黑沪
苏浙皖闽赣鲁豫鄂湘
粤桂琼渝川贵云藏陕
甘青宁新港澳使领学
The result is good enough for me!
Thank you all!
Original comment by xyxzfj@gmail.com
on 17 May 2012 at 1:44
Attachments:
I've spelled something wrong.
In comment 4, "
Clustering:(empty)
The last file (unicharambigs):(none)
" Should be changed into:"
Dictionary Data (Optional):(none)
The last file (unicharambigs):(none)
".
Original comment by xyxzfj@gmail.com
on 17 May 2012 at 2:22
I'll also have my trained data attached in case some one needs!
Original comment by xyxzfj@gmail.com
on 20 May 2012 at 10:44
Attachments:
Hi... im from Mexico and i have the same problem...
I follow all step by step but i continue having problems...
if some one can help me please....
Original comment by ing.raid...@gmail.com
on 24 May 2012 at 5:13
Attachments:
And when i wanto to use cntrainig the app crash and i dont know if the precces
have finished...
Original comment by ing.raid...@gmail.com
on 24 May 2012 at 5:18
Attachments:
@ing.raidel.herreraycairo:
1. you are not following instruction (see comment #2) - so your problems are
just your problems
2. you are not providing details (tesseract version, used commands)
3. it looks like you do not read to instructions carefully:
a) proper command is "mftraining -F font_properties -U unicharset -O mat.unicharset mat.placas.exp0.tr" and I see something else on screenshot
b) your font_properties has BOM and it is problem...
4. cntraining will not work if mftraining did not worked...
Original comment by zde...@gmail.com
on 24 May 2012 at 8:23
Ok thanx i did it and it works well... Excuse me for my bad english...
I only rename the unicharset file to mat.unicharset where mat its my -language.
However thank you very much...
Original comment by ing.raid...@gmail.com
on 27 May 2012 at 2:07
Attachments:
Original comment by zde...@gmail.com
on 21 Jul 2012 at 3:31
Original issue reported on code.google.com by
smr.meor...@gmail.com
on 7 Oct 2011 at 10:47Attachments: