manisandro / gImageReader

A Gtk/Qt front-end to tesseract-ocr.
GNU General Public License v3.0
1.63k stars 190 forks source link

gImageREADER does not find non-english dicts #13

Closed titus483 closed 9 years ago

titus483 commented 9 years ago

This is for gImageReader 3.0.1 under Windows 7. I followed the dictionary installation instructions and downloaded the german de_DE.zip and copied the de_DE.aff and de_DE.dic into /share/myspell/dicts. They are there along with the en_US files. But when I try to select "German" with "Recognize selection", even after "Redetect Languages" I can't select "German" (or "Deutsch"). There is just "English" -> "English (United States)" or "Multilingual" -> "English".

manisandro commented 9 years ago

Hi, Two types of language data are used by gImageReader:

So in short, while you installed the spellchecking dictionaries, you are missing the actual language support for tesseract. For German, you'll want to download this [2] and place the deu.traineddata file therein in the Tesseract language definitions folder (.../share/tessdata).

[1] https://code.google.com/p/tesseract-ocr/downloads/list [2] https://code.google.com/p/tesseract-ocr/downloads/detail?name=tesseract-ocr-3.02.deu.tar.gz

titus483 commented 9 years ago

Hi Sandro, thanks for your quick reply. Yes, I did that as a first step (sorry forgot to mention it): 1) I copied the deu.traineddata into the tessdata folder 2) I copied the .aff and .dic files into the gImageReader folder

I indeed followed an article of the German c't magazine 4/2015 where that is described step by step. But it still doesn't work for me...

Sandro Mani notifications@github.com hat am 24. Januar 2015 um 19:51 geschrieben:

Hi, Two types of language data are used by gImageReader:

 * The tesseract language definitions: these are necessary for performing

OCR for a specific language (tesseract is the OCR engine used by gImageReader). You can download these here [1]. * The spellchecking dictionaries. These are used to perform spell checking on the OCR result. the .aff and .dic are spelling dictionary files.

So in short, while you installed the spellchecking dictionaries, you are missing the actual language support for tesseract. For German, you'll want to download this [2] and place the deu.traineddata file therein in the Tesseract language definitions folder (.../share/tessdata).

[1] https://code.google.com/p/tesseract-ocr/downloads/list [2] https://code.google.com/p/tesseract-ocr/downloads/detail?name=tesseract-ocr-3.02.deu.tar.gz

— Reply to this email directly or view it on GitHub https://github.com/manisandro/gImageReader/issues/13#issuecomment-71331907 .

manisandro commented 9 years ago

So you should have something like this in .../usr:

|
|--> myspell
|   |--> dicts
|       |--> de_DE.aff
|       |--> de_DE.dic
|       |--> en_US.aff
|       |--> en_US.dic
|       |--> README.txt
|--> tessdata
    |--> deu.traineddata
    |--> eng.traineddata
    |--> README.txt

If this does not work (though that would really be a first), try only with the deu.traineddata file, without the de_DE.aff and de_DE.dic to see whether German appears as an entry in the menu.

titus483 commented 9 years ago

Great! Now I've got it. My mistake was that I copied the deu.traineddata into the Tesseract/tessdata folder, not into the gImageReader/.../tessdata folder! And now I've got the right menu. Thanks a lot for your help!

Sandro Mani notifications@github.com hat am 24. Januar 2015 um 20:43 geschrieben:

So you should have something like this in .../usr:

| |--> myspell | |--> dicts | |--> de_DE.aff | |--> de_DE.dic | |--> en_US.aff | |--> en_US.dic | |--> README.txt |--> tessdata |--> deu.traineddata |--> eng.traineddata |--> README.txt

If this does not work (though that would really be a first), try only with the deu.traineddata file, without the de_DE.aff and de_DE.dic to see whether German appears as an entry in the menu.

— Reply to this email directly or view it on GitHub https://github.com/manisandro/gImageReader/issues/13#issuecomment-71333990 .

manisandro commented 9 years ago

Ok cool!

narayaan commented 8 years ago

Maybe quite a noobish question, but I'm trying to add the Dutch tesseract data to gImageReader. A Google search led me to this page.

Since the tesseract code has been transferred to GitHub, I started looking there. I'm wondering which files exactly I should copy. All of them, or just the wordlist?

https://github.com/tesseract-ocr/langdata/tree/master/nld

narayaan commented 8 years ago

Found it, languages can now be dowloaded at: https://github.com/tesseract-ocr/tessdata

wally53 commented 8 years ago

Hi Sandro, I am using gImageReader 3.1.91 under Windows 7 with Tesseract 3.05.00 and I am trying to install the German Fraktur OCR software. I followed the dictionary installation instructions installed the following: 1) I copied the deu-frak.traineddata into the tessdata folder 2) I copied the .aff and .dic files into the gImageReader folder They are there along with the en_US files. But when I try to select "German" with "Recognize selection", even after "Redetect Languages" I can't select "German" (or "Deutsch"). There is just "English" -> "English (United States)" It seems that other people had this kind of problem solved in the past - so obviously I am missing somthing somwhere.

manisandro commented 8 years ago

To which tessdata folder did you download the traineddata files? gImageReader bundles tesseract, so you need to make sure you place the traineddata files in the ...\gImageReader\share\tessdata folder.

wally53 commented 8 years ago

The following files are in the ...\gImageReader\share\tessdata folder:

deu.traineddata deu-frak.traineddata eng.traineddata README

Am 7/24/2016 um 10:12 PM schrieb Sandro Mani:

To which tessdata folder did you download the traineddata files? gImageReader bundles tesseract, so you need to make sure you place the traineddata files in the ...\gImageReader\share\tessdata folder.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/manisandro/gImageReader/issues/13#issuecomment-234799197, or mute the thread https://github.com/notifications/unsubscribe-auth/ATrFLfiIVC_4NNQkfqGL2nlXqoFBNXsXks5qY8cSgaJpZM4DWv6i.


Dr. Walter T. Penzhorn Dorfstr 21 a / D-79576 Weil am Rhein Telefon: +49 (0)7621 / 425-0875 Webseite: www.wpenzhorn.de

manisandro commented 8 years ago

Did you make sure you downloaded the actual binary blob and not the html page on github for the traineddata file? Can you try with the integrated tessdata manager (you'll need to start the program as administrator)?

wally53 commented 8 years ago

The following files and their sizes are in the ...\gImageReader\share\tessdata folder:

deu.traineddata 13 054 KB deu-frak.traineddata 1 933 KB eng.traineddata 21 364 KB README 1 KB

Can you try with the integrated tessdata manager (you'll need to start the program as administrator)?

I have run the program as administrator, using the gImageReadr - without success. However, I am not too sure, what it means to run the "integrated tessdata manager"

Am 7/25/2016 um 12:37 PM schrieb Sandro Mani:

Did you make sure you downloaded the actual binary blob and not the html page on github for the traineddata file? Can you try with the integrated tessdata manager (you'll need to start the program as administrator)?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/manisandro/gImageReader/issues/13#issuecomment-234919568, or mute the thread https://github.com/notifications/unsubscribe-auth/ATrFLXNyHHmoDeqD_pEzHvotfig9bwqgks5qZJIBgaJpZM4DWv6i.


Dr. Walter T. Penzhorn Dorfstr 21 a / D-79576 Weil am Rhein Telefon: +49 (0)7621 / 425-0875 Webseite: www.wpenzhorn.de

manisandro commented 8 years ago

The integrated tessdata manger can be launched from the language selection menu -> "manage languages..." If that also does not work we need to do some proper debugging...

pmontrasio commented 6 years ago

On Ubuntu the solution is

sudo apt-get install myspell-de

Other languages have their own myspell file, examples: myspell-fr myspell-it. By the way, on Ubuntu the files in tessdata are installed with

sudo apt-get install tesseract-ocr-due tesseract-ocr-fra tesseract-ocr-ita