manisandro / gImageReader

A Gtk/Qt front-end to tesseract-ocr.
GNU General Public License v3.0
1.6k stars 188 forks source link

How do I add a new dictionary (Urdu)? #584

Closed ReaderGuy42 closed 2 years ago

ReaderGuy42 commented 2 years ago

I've searched online about this a lot, but I can't find much that applies; I'm not sure if in the newer versions of gImageReader something changed, because in an older issue you mention that the traineddata needs to be put in the tessdata within a gImageReader, but I can find no such folder to begin with. Only the tessdata in tesseract. I'm on Linux Mint if that makes a difference.

Then I put the .aff and .dic file in MySpell (which was a chore because I had a hard time finding the files for Urdu), but gImageReader doesn't seem to find the files, because the popup is still telling me that the Spelling dictionary is missing.

The OCR seems to work moderately well, I chose the BEST version of the traineddata, but I can't read enough Urdu to make certain, which is why a spellchecker would come in handy.

What am I doing wrong? Thanks :)

manisandro commented 2 years ago

Where did you place the .dic and .aff files? If you look ath the gImageReader configuration dialog, you'll see where it is looking for the files. Also, make sure the files are named correctly according to the language code of the language, and that the language is configured in the configuration dialog.

ReaderGuy42 commented 2 years ago

The files are in /usr/share/myspell/dicts.

I've named them both ur_PK. Is that correct?

manisandro commented 2 years ago

If you have an entry in the config dialog associating that code to the 3-char lang code of the traineddata file, then yes.

ReaderGuy42 commented 2 years ago

I'm not sure I understand that. What do you mean by config dialog? Do you mean the Preferences window? If so, then yes there's an entry for Urdu, saying the Code is ur: image

Why is gImageReader not finding the dictionary files then?

manisandro commented 2 years ago

Does it find other files in /usr/share/myspell? Do you have /usr/share/hunspell?

ReaderGuy42 commented 2 years ago

I do have the hunspell folder as well. I'm not sure if it finds the other files in myspell. I tried finding an Arabic dictionary file, but I can't find one, even though gImageReader has Arabic listed as a language, so I may have just installed that via gImageReader directly, if that's possible? I don't remember. So I can't really tell if it's finding other things in myspell (or hunspell for that matter).

After a quick check, it also tells me that it can't find an Arabic dictionary, which makes sense since I can't find the file. But the Urdu file is there, so I don't know what to do.

manisandro commented 2 years ago

You are using the system-wide folder configuration, right?

Also try the hunspell folder. Distros are moving away from /usr/share/myspell in favour of /usr/share/hunspell.

ReaderGuy42 commented 2 years ago

yes, it's set to "System-wide paths". I just sym-linked the two Urdu dictionary files into the hunspell folder and now the popup in gImageReader went away. But now it's no longer recognizing the Urdu alphabet at all, which is weird since in an earlier test it did work.

I've only been messing with the dictionary files, not the traineddata. Any ideas what may have changed there?

ReaderGuy42 commented 2 years ago

Wait, nevermind, I guess it just didn't like the title pages. So it works now, though I'm not sure how well the OCR'ing of Urdu works. Thanks for the help!!