Closed GoogleCodeExporter closed 9 years ago
Confirming, had the same problem with the Polish words list I'm working on now.
Attaching the source text file (frequent_polish_words_list.txt). The command
I've
used to build the DAWG was:
wordlist2dawg frequent_polish_words_list.txt freq-dawg
Original comment by aleksand...@gmail.com
on 12 Jul 2008 at 11:26
Attachments:
Dawg generation and use is loaded with fixed limit problems. A lot of these
will go
away in 3.00. In the mean time, I have updated the
FAQ(http://code.google.com/p/tesseract-ocr/wiki/FAQ) (wordlist2dawg doesn't
work!)
with some more tips on solving the problem.
Original comment by theraysm...@gmail.com
on 28 Dec 2008 at 7:20
Thanks Ray.
Is there any documentation on the new .traineddata files you're talking about
for v3?
Pierre.
Original comment by hicksc...@gmail.com
on 4 Apr 2010 at 11:56
Hello again,
For people interested in the new undocumented training data format, i've just
tried to understand how it works. i used the
eng.traineddata, and found the following, which is verified on other formats.
Header:
Always begins with 0A00 0000 FFFF FFFF FFFF FFFF. Maybe it's a version marker?
Then, the header is composed of offsets (i count 9 of them).
Header addr Points to... Remark
@0x000c Unicharset In all training data, was 0x0054 since the Unicharset was
always the first element after header.
@0x0014 Dang Ambigs
@0x001c Int Temp
@0x0024 PFFM Table
@0x002c Norm. Proto
@0x0034 Unknown.
@0x003c Unknown.
@0x0044 Unknown.
@0x004c Unknown.
Note that the last 4 blocks had a lot of similarities. i guess those are the
same "kind" of data.
i'll try to write a little packer once i'll have figured if each block contains
relative or whole offsets. Also i'll have to look at the source code
to make sure there is no mistake here.
Hope that helps,
Pierre.
Original comment by hicksc...@gmail.com
on 4 Apr 2010 at 1:00
Original issue is fixed in 3.00.
Training wiki updated for 3.00 and tessdatamanager/combine_tessdata updated for
better documentation on new traineddata format.
Original comment by theraysm...@gmail.com
on 20 May 2010 at 10:59
Original issue reported on code.google.com by
withbles...@gmail.com
on 18 Sep 2007 at 6:35