Closed GoogleCodeExporter closed 9 years ago
Is there any workaround for this issue? For instance by using some training
facility?
Original comment by plcarva...@gmail.com
on 12 Apr 2007 at 9:22
Well... you could probably add your own heuristics for these special cases
without
training (which does not yet work). For example, see how tess determines
whether a
"dot" is just noise or actually part of the letter "i":
http://tesseract-ocr.repairfaq.org/makerow_8cpp.html#90ccf46408d4dc726cb6ad4b7ab
b731d
Entry point might be here:
http://tesseract-ocr.repairfaq.org/makerow_8cpp.html#fdc4c4b87028fd7aafcb679f923
64d21
The "Meanwhile" part of your message is due to the permuter - the dictionary
included
with tess simply does not include words that contain certain letters in the
"wrong"
places so it will substitute something that makes "more" sense to it. See:
http://tesseract-ocr.repairfaq.org/allaboutdawg.html
Joke: You can probably include some of most common words that use é by using a
'6'
instead of the "é" just so that later you *know* which letter it found ;-)
Cheers,
Fil
Original comment by fil...@repairfaq.org
on 16 Apr 2007 at 1:26
To clarify, tesseract will need *formal* language-specific support to work as
well
for other languages as it does now for English. This is not only because it was
trained for English fonts but also because the DAWG only has English words in
it. So,
right now you have expected problems *recognizing* the non-english letters *and*
expected problems *verifying* that the recognized letters lead to a *valid*
English
word. There are ways to shut off the permuter (with the config file, I forget
the
option, sorry) but *trust me* you do not want to do that :-)
http://tesseract-ocr.repairfaq.org/allaboutdawg.html
Cheers,
Fil
Original comment by fil...@repairfaq.org
on 16 Apr 2007 at 1:31
Will be fixed in a future release.
Original comment by theraysm...@gmail.com
on 17 May 2007 at 7:26
In the hope that this can help you, I am hereby attaching samples written in
French,
scanned at 600 DPI and cleaned up in the GIMP. The columns could be parsed fine
by
OCROpus, but since the big problem here is accents, I figured I needed to submit
these to tesseract instead of OCROpus.
I don't think the 600dpi sample can be of much use: I have 1GiB of ram and
trying to
OCR it would make me swap to death instantly, whereas OCRing the 300dpi version
went
fine (using only a few MiBs of memory). At the same time, I have to ask, is
that huge
memory consumption when using the 600dpi sample normal at all?
If you want me to submit those samples to the ocropus project, just ask ;)
Original comment by nekoh...@gmail.com
on 30 May 2007 at 3:31
Attachments:
V2.00 will support English, French, German, Italian, Spanish, Dutch.
Original comment by theraysm...@gmail.com
on 7 Jul 2007 at 1:29
is there an estimated time of arrival for 2.0? The roadmap on the homepage is
very
vague...
Original comment by nekoh...@gmail.com
on 7 Jul 2007 at 2:33
I updated the roadmap. It is almost ready. There are still a few issues to
check and
some inconsistency to resolve. Look for it next week!
Original comment by theraysm...@gmail.com
on 13 Jul 2007 at 2:05
Original comment by theraysm...@gmail.com
on 18 Jul 2007 at 10:26
sorry, but it is not perfect just yet :) could this be reopened?
I have tested with my favorite samples, and certain characters screw up however.
Namely, in "french.png"
- e is converted to c
- o is converted to 0
- some nn are converted to m, others are converted to 11
- è is converted to é
- « and » are converted to < < and > >
- l is converted to 1
In the previous 300dpi.png sample from comment #5, the accents screw up a bit
(a lot)
more. Interestingly enough, 150dpi.png is slightly better parsed than
300dpi.png.
Original comment by nekoh...@gmail.com
on 19 Jul 2007 at 12:14
Attachments:
Original issue reported on code.google.com by
nekoh...@gmail.com
on 11 Apr 2007 at 2:06