patcharats / tesseract-ocr

Automatically exported from code.google.com/p/tesseract-ocr
Other
0 stars 0 forks source link

After training tesseract it dies when trying to create text from an image #47

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. Follow the procedure on
http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract to try to
train tesseract to recognize slovene chars
2. move the 8 files into the tessdata folder
3. tesseract <tif_image> output -l slo

What is the expected output? What do you see instead?
No output, just an output.txt file, but instead I get 'assertion
"ids.contains(unichar_repr, length)" failed: file "unicharset.cpp", line
67' 'Abort (core dumped)'.

What version of the product are you using? On what operating system?
2.00, on DragonFly BSD 1.9.0-DEVELOPMENT.

Please provide any additional information below.
I took a book from http://www.omnibus.se/beseda/ which has several free
eBooks, convert a PDF into tiff images (with ImageMagick: convert
056-1-1.pdf -colorspace gray -depth 8 056-%d.tif), took a few of these
tiffs (the ones I used are attached), and after following the training
procedure (the resulting 8 files are also attached), I moved the files into
my tessdata folder, tried to test tesseract with these added files on an
image and it just dies (the error message is written under "What is the
expected output? What do you see instead?").

Original issue reported on code.google.com by rum...@gmail.com on 22 Jul 2007 at 1:26

Attachments:

GoogleCodeExporter commented 9 years ago
Found the problem, the DangAmbigs file is causing it to crash, without it 
continues
(but creates an almost empty output.txt file when I issue "tesseract 056-10.tif
output -l slo", contains only a few spaces it seems). But I don't see anything 
wrong
with DangAmbigs? I copied the English version and deleted a few lines (contained
characters that should not be used in slovene).

Original comment by rum...@gmail.com on 22 Jul 2007 at 3:46

Attachments:

GoogleCodeExporter commented 9 years ago
 Have you solved the problem ? If so, step by step procedure followed to create
slo.freq-dawg using commandline "wordlist2dawg frequent_words_list freq-dawg" 
may
please be explained in detail for benefit of others. It would be nice if you 
upload
copies of wordlists created by you for the purpose of (1) freq-dawg and 
(2)word-dawg.

In my case, I could not create freq-drawg for Kannada lan.

Original comment by withbles...@gmail.com on 4 Aug 2007 at 5:38

GoogleCodeExporter commented 9 years ago
I have also tried to teach tesseract Slovene language and had the same problem. 
I
solved it with building *.box files with at least one box for every letter 
known in
Slovene language (this at-least-one-sample-of-every-letter is probably also 
needed to
teach tesseract properly) so that the resulting unicharset list had all 
characters in
Slovene language (and numbers, other symbols ...). (In my first version of it 
and in
previously attached slo.unicharset file some of them are missing.)

I think this is still a bug as it should print some meaningful error message. 
For
example at least: "Found a letter not in the unicharset list."

The results are just horrible. I will have to iterate the learning process (use
current version of learned Slovene language to read some more pages and repeat).

I am attaching the 1163700 words word_list and 50 words frequent_words_list I 
got
from aspell and Wikipedia:

http://sl.wikipedia.org/wiki/Najpogostej%C5%A1e_slovenske_besede

It took around three hours to compile word_list dawg file. :-)

Original comment by mmi...@gmail.com on 8 Aug 2007 at 9:24

Attachments:

GoogleCodeExporter commented 9 years ago
withblessings: there is no step by step procedure, a word per line and issue 
that
command which you have specified :)

mmitar: wow, thank you for sharing that :) ... and yes about that you have to 
teach
it all the letters I unfortunately already know.

It seems kind of a bad move to have to teach a language from scratch. There are 
many
languages that share the same letters (all of the latin1 charset except "x", 
"y" and
"z" is present in surely more than ... 30 languages?) so I see it as a _great_
disadvantage that every single letter is language specific. There should've 
been a
global stash of letters (like latin1 charset) and then each additional language 
can
define it's own _additional_ letters.

Original comment by rum...@gmail.com on 8 Aug 2007 at 5:57

GoogleCodeExporter commented 9 years ago
1. Note added to the TrainingTesseract wiki to confirm that you have to check 
the
output for errors and fix the box files to make sure there is at least one 
sample of
each character before continuing.

2. Agreed it is unfortunate that you have to supply samples of every character. 
While
it would be possible to take data from existing .tr files and just add a few new
characters, this would lead to a complexity nightmare compared with the current
training process, which you surely agree is complex enough. For one thing, the 
risk
of unicharset not matching the set of characters in the .tr files would be 
massively
increased. For another the complex sort and merge operation required would be 
hard
for most windows users to do as it would require heavy use of a unix shell like 
cygwin.

3. It seems that most (if not all) of the people currently training tesseract 
are
using windows, except at Google, where we are using Linux. That makes it harder 
for
us to support the training effort, as many useful things that we could do for 
one
platform would be useless for the other. However, your suggestion is a good 
one, and
I can see that it would be possible to build a small app that could do this 
sorting
and merging on windows. (Something that looks a bit like character map) Any
volunteers to build it?

Original comment by theraysm...@gmail.com on 17 Aug 2007 at 4:01

GoogleCodeExporter commented 9 years ago
With reference to "(Something that looks a bit like character map)", it is 
available
in MSwindows like XP as a default for all world languages  -vide character 
Map.png
uploaded. As such, ssmall app has to be created to enable tesseract to call
CharacterMap from OS like XP and select lang reuired.
To view all world languages, it has to be enabled in Control panel ->"Regional &
Lan..." -vide Regional & language options.png (which is self explnatory) 
uploaded. 

Original comment by withbles...@gmail.com on 17 Aug 2007 at 5:17

Attachments:

GoogleCodeExporter commented 9 years ago
theraysmith: 3. I don't use Windows, I prefer BSD, so "if not all" is not 
likely. :P

Original comment by rum...@gmail.com on 20 Aug 2007 at 10:45

GoogleCodeExporter commented 9 years ago
I replaced
  assert(length > 0 && length <= UNICHAR_LEN);
  assert(ids.contains(unichar_repr, length));
  return ids.unichar_to_id(unichar_repr, length);

with
  if ( ids.contains(unichar_repr, length) ) {
    return ids.unichar_to_id(unichar_repr, length);
  }
  else { 
    // what a pity.
    return 2;
  }

where 2 is just an arbitrary value. I did not take the time to look which value 
might
make more sense I just assumed that the index "2" exists and I did not bother 
to dig
into the details of the inner structures. 

I do not care much if a single character is not recognized. There are lots of 
others
that will not be recognized either when reading fraktur. But asserting and 
dumping
core just because the config file has some problems definitely is a bad idea.

Original comment by heiko.ev...@gmx.de on 8 Jan 2008 at 10:43

GoogleCodeExporter commented 9 years ago
I'm attempting to train tesseract to work on a dictionary digitization project 
for
the Salishan language Lillooet. I went through the training, reran the OCR on 
the
training page to make sure there were no mistakes and found one. I corrected it,
reran all the necessary commands (tesseract ... box.train, mftraining, 
cntraining,
unicharset_extractor) and tried again. When I did so, I started getting the 
above
assertion. I added a print statement to figure out where it dies and the 
following is
what shows up:
x̌wəmʼ-c-minʼ!'to
wəmʼ-c-minʼ!'to
x̌wəmʼ-c-minʼ!'to
əmʼ-c-minʼ!'to
əmʼ-c-minʼ!'to
x̌ʷəmʼ-c-minʼ!!
̌ʷəmʼ-c-minʼ!!
For some reason, tesseract is stepping through this string and removes the x 
without
bringing the caron with it. (There does not appear to be an X WITH CARON 
character in
Unicode, so the combining character is necessary.) However, it doesn't do this
earlier. The caron alone is nowhere in the repertoire and shouldn't be, as it 
never
appears in isolation. Any idea what the cause of this is? (Let me know if I 
should
attach files.)

Original comment by leftmost...@gmail.com on 6 Mar 2008 at 7:11

GoogleCodeExporter commented 9 years ago
I am receiving this error.  My box file did not have any "fatalities".  It 
recognized
and identified all characters.  The training process seemed to complete okay, 
and I
copied the resultant 8 files to a brand new language, named by the font name
FiveLineThinFont.  When I feed a .txt file in, I get the assert and core dump.  
What am I doing wrong?  Is this thread saying that every language must contain a
character for every other language?  Doesn't the -l option take care of this?

Original comment by rebecca....@polycom.com on 21 Jul 2008 at 4:02

GoogleCodeExporter commented 9 years ago
Comment 11 - follow on to 10
I am using version 2.01.

Original comment by rebecca....@polycom.com on 21 Jul 2008 at 4:04

GoogleCodeExporter commented 9 years ago
These issues were resolved in 2.03.

Original comment by theraysm...@gmail.com on 30 Dec 2008 at 9:36