patcharats / tesseract-ocr

Automatically exported from code.google.com/p/tesseract-ocr
Other
0 stars 0 forks source link

accent support #25

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
well, I did not see a bug report about this already, so here I go.

Tesseract only supports english characters. It would be really nice to be
able to OCR texts in other languages, such as French that has accents such
as é à û ù etc, spanish, etc. Of course there are other kinds of more
complex languages, but supporting accents would support a bunch of latin
languages I presume.

Meanwhile, it's funny to look at the character "é" being recognized as "e"
or "6" :)

Original issue reported on code.google.com by nekoh...@gmail.com on 11 Apr 2007 at 2:06

GoogleCodeExporter commented 9 years ago
Is there any workaround for this issue? For instance by using some training 
facility?

Original comment by plcarva...@gmail.com on 12 Apr 2007 at 9:22

GoogleCodeExporter commented 9 years ago
Well... you could probably add your own heuristics for these special cases 
without
training (which does not yet work). For example, see how tess determines 
whether a
"dot" is just noise or actually part of the letter "i":
http://tesseract-ocr.repairfaq.org/makerow_8cpp.html#90ccf46408d4dc726cb6ad4b7ab
b731d
Entry point might be here:
http://tesseract-ocr.repairfaq.org/makerow_8cpp.html#fdc4c4b87028fd7aafcb679f923
64d21

The "Meanwhile" part of your message is due to the permuter - the dictionary 
included
with tess simply does not include words that contain certain letters in the 
"wrong"
places so it will substitute something that makes "more" sense to it. See:
http://tesseract-ocr.repairfaq.org/allaboutdawg.html

Joke: You can probably include some of most common words that use é by using a 
'6'
instead of the "é" just so that later you *know* which letter it found ;-)

Cheers,
Fil

Original comment by fil...@repairfaq.org on 16 Apr 2007 at 1:26

GoogleCodeExporter commented 9 years ago
To clarify, tesseract will need *formal* language-specific support to work as 
well
for other languages as it does now for English. This is not only because it was
trained for English fonts but also because the DAWG only has English words in 
it. So,
right now you have expected problems *recognizing* the non-english letters *and*
expected problems *verifying* that the recognized letters lead to a *valid* 
English
word. There are ways to shut off the permuter (with the config file, I forget 
the
option, sorry) but *trust me* you do not want to do that :-)

http://tesseract-ocr.repairfaq.org/allaboutdawg.html

Cheers,
Fil

Original comment by fil...@repairfaq.org on 16 Apr 2007 at 1:31

GoogleCodeExporter commented 9 years ago
Will be fixed in a future release.

Original comment by theraysm...@gmail.com on 17 May 2007 at 7:26

GoogleCodeExporter commented 9 years ago
In the hope that this can help you, I am hereby attaching samples written in 
French,
scanned at 600 DPI and cleaned up in the GIMP. The columns could be parsed fine 
by
OCROpus, but since the big problem here is accents, I figured I needed to submit
these to tesseract instead of OCROpus.

I don't think the 600dpi sample can be of much use: I have 1GiB of ram and 
trying to
OCR it would make me swap to death instantly, whereas OCRing the 300dpi version 
went
fine (using only a few MiBs of memory). At the same time, I have to ask, is 
that huge
memory consumption when using the 600dpi sample normal at all?

If you want me to submit those samples to the ocropus project, just ask ;)

Original comment by nekoh...@gmail.com on 30 May 2007 at 3:31

Attachments:

GoogleCodeExporter commented 9 years ago
V2.00 will support English, French, German, Italian, Spanish, Dutch.

Original comment by theraysm...@gmail.com on 7 Jul 2007 at 1:29

GoogleCodeExporter commented 9 years ago
is there an estimated time of arrival for 2.0? The roadmap on the homepage is 
very
vague...

Original comment by nekoh...@gmail.com on 7 Jul 2007 at 2:33

GoogleCodeExporter commented 9 years ago
I updated the roadmap. It is almost ready. There are still a few issues to 
check and
some inconsistency to resolve. Look for it next week!

Original comment by theraysm...@gmail.com on 13 Jul 2007 at 2:05

GoogleCodeExporter commented 9 years ago

Original comment by theraysm...@gmail.com on 18 Jul 2007 at 10:26

GoogleCodeExporter commented 9 years ago
sorry, but it is not perfect just yet :) could this be reopened?

I have tested with my favorite samples, and certain characters screw up however.
Namely, in "french.png"
- e is converted to c
- o is converted to 0
- some nn are converted to m, others are converted to 11
- è is converted to é
- « and » are converted to < < and > >
- l is converted to 1

In the previous 300dpi.png sample from comment #5, the accents screw up a bit 
(a lot)
more. Interestingly enough, 150dpi.png is slightly better parsed than 
300dpi.png.

Original comment by nekoh...@gmail.com on 19 Jul 2007 at 12:14

Attachments: