Closed GoogleCodeExporter closed 9 years ago
Found the problem, the DangAmbigs file is causing it to crash, without it
continues
(but creates an almost empty output.txt file when I issue "tesseract 056-10.tif
output -l slo", contains only a few spaces it seems). But I don't see anything
wrong
with DangAmbigs? I copied the English version and deleted a few lines (contained
characters that should not be used in slovene).
Original comment by rum...@gmail.com
on 22 Jul 2007 at 3:46
Attachments:
Have you solved the problem ? If so, step by step procedure followed to create
slo.freq-dawg using commandline "wordlist2dawg frequent_words_list freq-dawg"
may
please be explained in detail for benefit of others. It would be nice if you
upload
copies of wordlists created by you for the purpose of (1) freq-dawg and
(2)word-dawg.
In my case, I could not create freq-drawg for Kannada lan.
Original comment by withbles...@gmail.com
on 4 Aug 2007 at 5:38
I have also tried to teach tesseract Slovene language and had the same problem.
I
solved it with building *.box files with at least one box for every letter
known in
Slovene language (this at-least-one-sample-of-every-letter is probably also
needed to
teach tesseract properly) so that the resulting unicharset list had all
characters in
Slovene language (and numbers, other symbols ...). (In my first version of it
and in
previously attached slo.unicharset file some of them are missing.)
I think this is still a bug as it should print some meaningful error message.
For
example at least: "Found a letter not in the unicharset list."
The results are just horrible. I will have to iterate the learning process (use
current version of learned Slovene language to read some more pages and repeat).
I am attaching the 1163700 words word_list and 50 words frequent_words_list I
got
from aspell and Wikipedia:
http://sl.wikipedia.org/wiki/Najpogostej%C5%A1e_slovenske_besede
It took around three hours to compile word_list dawg file. :-)
Original comment by mmi...@gmail.com
on 8 Aug 2007 at 9:24
Attachments:
withblessings: there is no step by step procedure, a word per line and issue
that
command which you have specified :)
mmitar: wow, thank you for sharing that :) ... and yes about that you have to
teach
it all the letters I unfortunately already know.
It seems kind of a bad move to have to teach a language from scratch. There are
many
languages that share the same letters (all of the latin1 charset except "x",
"y" and
"z" is present in surely more than ... 30 languages?) so I see it as a _great_
disadvantage that every single letter is language specific. There should've
been a
global stash of letters (like latin1 charset) and then each additional language
can
define it's own _additional_ letters.
Original comment by rum...@gmail.com
on 8 Aug 2007 at 5:57
1. Note added to the TrainingTesseract wiki to confirm that you have to check
the
output for errors and fix the box files to make sure there is at least one
sample of
each character before continuing.
2. Agreed it is unfortunate that you have to supply samples of every character.
While
it would be possible to take data from existing .tr files and just add a few new
characters, this would lead to a complexity nightmare compared with the current
training process, which you surely agree is complex enough. For one thing, the
risk
of unicharset not matching the set of characters in the .tr files would be
massively
increased. For another the complex sort and merge operation required would be
hard
for most windows users to do as it would require heavy use of a unix shell like
cygwin.
3. It seems that most (if not all) of the people currently training tesseract
are
using windows, except at Google, where we are using Linux. That makes it harder
for
us to support the training effort, as many useful things that we could do for
one
platform would be useless for the other. However, your suggestion is a good
one, and
I can see that it would be possible to build a small app that could do this
sorting
and merging on windows. (Something that looks a bit like character map) Any
volunteers to build it?
Original comment by theraysm...@gmail.com
on 17 Aug 2007 at 4:01
With reference to "(Something that looks a bit like character map)", it is
available
in MSwindows like XP as a default for all world languages -vide character
Map.png
uploaded. As such, ssmall app has to be created to enable tesseract to call
CharacterMap from OS like XP and select lang reuired.
To view all world languages, it has to be enabled in Control panel ->"Regional &
Lan..." -vide Regional & language options.png (which is self explnatory)
uploaded.
Original comment by withbles...@gmail.com
on 17 Aug 2007 at 5:17
Attachments:
theraysmith: 3. I don't use Windows, I prefer BSD, so "if not all" is not
likely. :P
Original comment by rum...@gmail.com
on 20 Aug 2007 at 10:45
I replaced
assert(length > 0 && length <= UNICHAR_LEN);
assert(ids.contains(unichar_repr, length));
return ids.unichar_to_id(unichar_repr, length);
with
if ( ids.contains(unichar_repr, length) ) {
return ids.unichar_to_id(unichar_repr, length);
}
else {
// what a pity.
return 2;
}
where 2 is just an arbitrary value. I did not take the time to look which value
might
make more sense I just assumed that the index "2" exists and I did not bother
to dig
into the details of the inner structures.
I do not care much if a single character is not recognized. There are lots of
others
that will not be recognized either when reading fraktur. But asserting and
dumping
core just because the config file has some problems definitely is a bad idea.
Original comment by heiko.ev...@gmx.de
on 8 Jan 2008 at 10:43
I'm attempting to train tesseract to work on a dictionary digitization project
for
the Salishan language Lillooet. I went through the training, reran the OCR on
the
training page to make sure there were no mistakes and found one. I corrected it,
reran all the necessary commands (tesseract ... box.train, mftraining,
cntraining,
unicharset_extractor) and tried again. When I did so, I started getting the
above
assertion. I added a print statement to figure out where it dies and the
following is
what shows up:
x̌wəmʼ-c-minʼ!'to
wəmʼ-c-minʼ!'to
x̌wəmʼ-c-minʼ!'to
əmʼ-c-minʼ!'to
əmʼ-c-minʼ!'to
x̌ʷəmʼ-c-minʼ!!
̌ʷəmʼ-c-minʼ!!
For some reason, tesseract is stepping through this string and removes the x
without
bringing the caron with it. (There does not appear to be an X WITH CARON
character in
Unicode, so the combining character is necessary.) However, it doesn't do this
earlier. The caron alone is nowhere in the repertoire and shouldn't be, as it
never
appears in isolation. Any idea what the cause of this is? (Let me know if I
should
attach files.)
Original comment by leftmost...@gmail.com
on 6 Mar 2008 at 7:11
I am receiving this error. My box file did not have any "fatalities". It
recognized
and identified all characters. The training process seemed to complete okay,
and I
copied the resultant 8 files to a brand new language, named by the font name
FiveLineThinFont. When I feed a .txt file in, I get the assert and core dump.
What am I doing wrong? Is this thread saying that every language must contain a
character for every other language? Doesn't the -l option take care of this?
Original comment by rebecca....@polycom.com
on 21 Jul 2008 at 4:02
Comment 11 - follow on to 10
I am using version 2.01.
Original comment by rebecca....@polycom.com
on 21 Jul 2008 at 4:04
These issues were resolved in 2.03.
Original comment by theraysm...@gmail.com
on 30 Dec 2008 at 9:36
Original issue reported on code.google.com by
rum...@gmail.com
on 22 Jul 2007 at 1:26Attachments: