Error: trying to read a DAWG kan(240 lines).freq-dawg that contains 1714 edges while the maximum is 1500."

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1. After successfully created 8 datafiles for Kannada (240 lines), tried to
run tesseract
2. Instead of generating output.txt, log error was generated as
"Error: trying to read a DAWG /tessdata/kan.freq-dawg'
that contains 1714 edges while the maximum is 1500."
3.

What is the expected output? What do you see instead?
Did not generate output text. Only after deleted all entries made in
words_list( i.e. words_list.txt left blank),there was 
no error log message.

What version of the product are you using? On what operating system?
Tesseract2.01  XP

Please provide any additional information below.
The purpose of creating "frequent_words_list" or "words_list" 
is defeated in view of above generated error log. Solution is 
requested.

Original issue reported on code.google.com by withbles...@gmail.com on 18 Sep 2007 at 6:35

GoogleCodeExporter commented 9 years ago

Confirming, had the same problem with the Polish words list I'm working on now.

Attaching the source text file (frequent_polish_words_list.txt). The command 
I've
used to build the DAWG was:

wordlist2dawg frequent_polish_words_list.txt freq-dawg

Original comment by aleksand...@gmail.com on 12 Jul 2008 at 11:26

Attachments:

frequent_polish_words_list.txt

GoogleCodeExporter commented 9 years ago

Dawg generation and use is loaded with fixed limit problems. A lot of these 
will go 
away in 3.00. In the mean time, I have updated the 
FAQ(http://code.google.com/p/tesseract-ocr/wiki/FAQ) (wordlist2dawg doesn't 
work!) 
with some more tips on solving the problem.

Original comment by theraysm...@gmail.com on 28 Dec 2008 at 7:20

Changed state: Accepted

GoogleCodeExporter commented 9 years ago

Thanks Ray.

Is there any documentation on the new .traineddata files you're talking about 
for v3?

Pierre.

Original comment by hicksc...@gmail.com on 4 Apr 2010 at 11:56

GoogleCodeExporter commented 9 years ago

Hello again,

For people interested in the new undocumented training data format, i've just 
tried to understand how it works. i used the 
eng.traineddata, and found the following, which is verified on other formats.

Header:
Always begins with 0A00 0000 FFFF FFFF FFFF FFFF. Maybe it's a version marker?
Then, the header is composed of offsets (i count 9 of them).
Header addr     Points to...        Remark
@0x000c     Unicharset      In all training data, was 0x0054 since the Unicharset was 
always the first element after header.
@0x0014     Dang Ambigs
@0x001c     Int Temp
@0x0024     PFFM Table
@0x002c     Norm. Proto
@0x0034     Unknown.
@0x003c     Unknown.
@0x0044     Unknown.
@0x004c     Unknown.
Note that the last 4 blocks had a lot of similarities. i guess those are the 
same "kind" of data.
i'll try to write a little packer once i'll have figured if each block contains 
relative or whole offsets. Also i'll have to look at the source code 
to make sure there is no mistake here.

Hope that helps,
Pierre.

Original comment by hicksc...@gmail.com on 4 Apr 2010 at 1:00

GoogleCodeExporter commented 9 years ago

Original issue is fixed in 3.00.
Training wiki updated for 3.00 and tessdatamanager/combine_tessdata updated for
better documentation on new traineddata format.

Original comment by theraysm...@gmail.com on 20 May 2010 at 10:59

Changed state: Fixed

patcharats / tesseract-ocr

Error: trying to read a DAWG kan(240 lines).freq-dawg that contains 1714 edges while the maximum is 1500." #68