TesseractOCR

Authors: Rui Fontes, Ângelo Abrantes and Abel Passos do Nascimento Jr.
Download stable version
Compatibility: NVDA version 2019.3 and beyond

Information

This add-on uses the free and open source Tesseract OCR engine, to perform optical character recognition on an image file, PDF, JPG, TIF or other, without the need to open it. The text file will bve placed at same folder with the same name of the original file but with .TXT extension. It also allows access to WIA enabled scanners to perform OCR to a paper document. The results are shown in a file named OCR.txt placed in users Documents folder. At last, it also can get the accessible text from an accessible PDF, using XPDF tools. In the NVDA menu, Preferences, a TesseractOCR section is added, where you can configure the following:

languages to be used in recognition;
the type of documents to be recognized;
if should be asked or not a PDF password. If you have this option checked, and the PDF does not have a password, just press Enter in the dialog asking for password;
set the scanner resolution between 150 and 400 dpi;
Option to detect the paper orientation;
Option to use or not tones to signalize the work progress.

With the exception of English and Portuguese, which are already included in add-on, the other languages will be downloaded and installed when you select a language that does not already exist in the add-on. Note that as the number of selected recognition languages increases, the OCR process will take longer. We therefore recommend that you use only the languages you need. Note also that the quality of recognition may vary according to the order of languages. Therefore, if the recognition result is not satisfactory, you may want to try another language ordering.

Shortcut

The default commands are: Windows+Control+w - to scan and recognize a document through the scanner; Windows+Control+r - to recognize the selected document; Windows+Control+t - To get the text from an accessible PDF; Windows+Control+c - To cancel the scanning process. Please note: It must be issued before the dialog asking if you want to scan more pages appear!

Then just wait the text file appears with the recognized text.

This commands can be modified in the "Input gestures" dialog in the "TesseractOCR" section.

Known problems

When selecting the "Various" option in the "Documents type" combobox, the recognized text probably appear with many blank lines This is a known problem with Tesseract, and, without consumming lots of processing time, I haven't yet found any solution. But, I still haven't given up!

Languages supported

The supported languages in this version are:

Afrikans
Albanian
Amharik
Arabic
Armenian
Assamese
Azerbaijani (Latin)
Basque
Belarusian
Bengali
Bosnian
Breton
Bulgarian
Burnese
Catalan/Valencian
Cebuano
Cherokee
Chinese simplified
Chinese traditional
Corsican
Croatian
Czech
Dannish
Deutch
Dhivehi
Dutch (Flemish)
Dzongkha
English
Esperanto
Estonian
Faroese
Filipino
Finnish
French
Galician
Georgian
Greek
Gujarati
Haitian
Hebrew
Hindi
Hungarian
Icelandic
Indonesian
Inuktitut
Irish
Italian
Javanese
Japanese
Kannada
Kazakh
Khmer (Central)
Kirghiz
Korean
Kurdish Kurmanji
Lao
Latin
Lativia
Lituanian
Luxembourgish
Macedonian
Malay
Malayalam
Maltese
Maori
Marathi
Math / equation detection module
Mongolian
Nepali
Norwegian
Occitan
Oriya
Panjabi
Pashto
Persian
Polish
Portuguese
Quechua
Romanian/Moldave
Russian
Sanskrit
Scottish Gaelic
Serbian (Latin)
Slovak)
Slovenian)
Sindhi
Sinhalese
Spanish
Sundanese
Swahili
Swedish
Syriac
Tajik
Tamil
Tatar
Telugu
Thai
Tibetan
Tigrinya
Tonga
Turkish
Uighur
Ukrainian
Urdu
Uzbek (Latin)
Vietnamese
Welsh
West Frisian
Yiddish
Yoruba

Image types supported

This add-on supports the following types of files:

PDF
jpg
tif
png
bmp
pnm
pbm
pgm
jp2
gif
jfif
jpeg
tiff
spix
webp

ruifontes / tesseractOCR