Closed titus483 closed 9 years ago
Hi, Two types of language data are used by gImageReader:
*.aff
and *.dic
are spellchecking dictionary files.So in short, while you installed the spellchecking dictionaries, you are missing the actual language support for tesseract. For German, you'll want to download this [2] and place the deu.traineddata
file therein in the Tesseract language definitions
folder (.../share/tessdata
).
[1] https://code.google.com/p/tesseract-ocr/downloads/list [2] https://code.google.com/p/tesseract-ocr/downloads/detail?name=tesseract-ocr-3.02.deu.tar.gz
Hi Sandro, thanks for your quick reply. Yes, I did that as a first step (sorry forgot to mention it): 1) I copied the deu.traineddata into the tessdata folder 2) I copied the .aff and .dic files into the gImageReader folder
I indeed followed an article of the German c't magazine 4/2015 where that is described step by step. But it still doesn't work for me...
Sandro Mani notifications@github.com hat am 24. Januar 2015 um 19:51 geschrieben:
Hi, Two types of language data are used by gImageReader:
* The tesseract language definitions: these are necessary for performing
OCR for a specific language (tesseract is the OCR engine used by gImageReader). You can download these here [1]. * The spellchecking dictionaries. These are used to perform spell checking on the OCR result. the .aff and .dic are spelling dictionary files.
So in short, while you installed the spellchecking dictionaries, you are missing the actual language support for tesseract. For German, you'll want to download this [2] and place the deu.traineddata file therein in the Tesseract language definitions folder (.../share/tessdata).
[1] https://code.google.com/p/tesseract-ocr/downloads/list [2] https://code.google.com/p/tesseract-ocr/downloads/detail?name=tesseract-ocr-3.02.deu.tar.gz
— Reply to this email directly or view it on GitHub https://github.com/manisandro/gImageReader/issues/13#issuecomment-71331907 .
So you should have something like this in .../usr
:
|
|--> myspell
| |--> dicts
| |--> de_DE.aff
| |--> de_DE.dic
| |--> en_US.aff
| |--> en_US.dic
| |--> README.txt
|--> tessdata
|--> deu.traineddata
|--> eng.traineddata
|--> README.txt
If this does not work (though that would really be a first), try only with the deu.traineddata
file, without the de_DE.aff
and de_DE.dic
to see whether German appears as an entry in the menu.
Great! Now I've got it. My mistake was that I copied the deu.traineddata into the Tesseract/tessdata folder, not into the gImageReader/.../tessdata folder! And now I've got the right menu. Thanks a lot for your help!
Sandro Mani notifications@github.com hat am 24. Januar 2015 um 20:43 geschrieben:
So you should have something like this in .../usr:
| |--> myspell | |--> dicts | |--> de_DE.aff | |--> de_DE.dic | |--> en_US.aff | |--> en_US.dic | |--> README.txt |--> tessdata |--> deu.traineddata |--> eng.traineddata |--> README.txt
If this does not work (though that would really be a first), try only with the deu.traineddata file, without the de_DE.aff and de_DE.dic to see whether German appears as an entry in the menu.
— Reply to this email directly or view it on GitHub https://github.com/manisandro/gImageReader/issues/13#issuecomment-71333990 .
Ok cool!
Maybe quite a noobish question, but I'm trying to add the Dutch tesseract data to gImageReader. A Google search led me to this page.
Since the tesseract code has been transferred to GitHub, I started looking there. I'm wondering which files exactly I should copy. All of them, or just the wordlist?
Found it, languages can now be dowloaded at: https://github.com/tesseract-ocr/tessdata
Hi Sandro, I am using gImageReader 3.1.91 under Windows 7 with Tesseract 3.05.00 and I am trying to install the German Fraktur OCR software. I followed the dictionary installation instructions installed the following: 1) I copied the deu-frak.traineddata into the tessdata folder 2) I copied the .aff and .dic files into the gImageReader folder They are there along with the en_US files. But when I try to select "German" with "Recognize selection", even after "Redetect Languages" I can't select "German" (or "Deutsch"). There is just "English" -> "English (United States)" It seems that other people had this kind of problem solved in the past - so obviously I am missing somthing somwhere.
To which tessdata folder did you download the traineddata files? gImageReader bundles tesseract, so you need to make sure you place the traineddata files in the ...\gImageReader\share\tessdata folder.
The following files are in the ...\gImageReader\share\tessdata folder:
deu.traineddata deu-frak.traineddata eng.traineddata README
Am 7/24/2016 um 10:12 PM schrieb Sandro Mani:
To which tessdata folder did you download the traineddata files? gImageReader bundles tesseract, so you need to make sure you place the traineddata files in the ...\gImageReader\share\tessdata folder.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/manisandro/gImageReader/issues/13#issuecomment-234799197, or mute the thread https://github.com/notifications/unsubscribe-auth/ATrFLfiIVC_4NNQkfqGL2nlXqoFBNXsXks5qY8cSgaJpZM4DWv6i.
Dr. Walter T. Penzhorn Dorfstr 21 a / D-79576 Weil am Rhein Telefon: +49 (0)7621 / 425-0875 Webseite: www.wpenzhorn.de
Did you make sure you downloaded the actual binary blob and not the html page on github for the traineddata file? Can you try with the integrated tessdata manager (you'll need to start the program as administrator)?
The following files and their sizes are in the ...\gImageReader\share\tessdata folder:
deu.traineddata 13 054 KB deu-frak.traineddata 1 933 KB eng.traineddata 21 364 KB README 1 KB
Can you try with the integrated tessdata manager (you'll need to start the program as administrator)?
I have run the program as administrator, using the gImageReadr - without success. However, I am not too sure, what it means to run the "integrated tessdata manager"
Am 7/25/2016 um 12:37 PM schrieb Sandro Mani:
Did you make sure you downloaded the actual binary blob and not the html page on github for the traineddata file? Can you try with the integrated tessdata manager (you'll need to start the program as administrator)?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/manisandro/gImageReader/issues/13#issuecomment-234919568, or mute the thread https://github.com/notifications/unsubscribe-auth/ATrFLXNyHHmoDeqD_pEzHvotfig9bwqgks5qZJIBgaJpZM4DWv6i.
Dr. Walter T. Penzhorn Dorfstr 21 a / D-79576 Weil am Rhein Telefon: +49 (0)7621 / 425-0875 Webseite: www.wpenzhorn.de
The integrated tessdata manger can be launched from the language selection menu -> "manage languages..." If that also does not work we need to do some proper debugging...
On Ubuntu the solution is
sudo apt-get install myspell-de
Other languages have their own myspell file, examples: myspell-fr
myspell-it
.
By the way, on Ubuntu the files in tessdata
are installed with
sudo apt-get install tesseract-ocr-due tesseract-ocr-fra tesseract-ocr-ita
This is for gImageReader 3.0.1 under Windows 7. I followed the dictionary installation instructions and downloaded the german de_DE.zip and copied the de_DE.aff and de_DE.dic into /share/myspell/dicts. They are there along with the en_US files. But when I try to select "German" with "Recognize selection", even after "Redetect Languages" I can't select "German" (or "Deutsch"). There is just "English" -> "English (United States)" or "Multilingual" -> "English".