ocrmypdf / OCRmyPDF

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
http://ocrmypdf.readthedocs.io/
Mozilla Public License 2.0
14.01k stars 1.01k forks source link

[Bug]: Unable to proceed with a custom language lacking a dictionary #1411

Closed vchgan closed 2 days ago

vchgan commented 2 days ago

Describe the bug

I used to be able to OCR images using a custom IAST traineddata set despite error messages that it was lacking a dictionary, but now ocrmypdf does not proceed and gives the following message:

OCR engine does not have language data for the following requested languages: main.py:69 IAST Please install the appropriate language data for your OCR engine.

See the online documentation for instructions: https://ocrmypdf.readthedocs.io/en/latest/languages.html

Note: most languages are identified by a 3-letter ISO 639-2 Code. For example, English is 'eng', German is 'deu', and Spanish is 'spa'. Simplified Chinese is 'chi_sim' and Traditional Chinese is 'chi_tra'.

Tesseract does give an error message but proceeds with the text recognition correctly, so it's not a problem with the traineddata set or with Tesseract. In fact recent discussions indicate that the linked language dictionary is no longer important in recognition.

Failed to load any lstm-specific dictionaries for lang IAST!! Tesseract Open Source OCR Engine v4.1.1 with Leptonica

Not sure why the change happened or if it can be corrected? The purpose for me is recognition of Sanskrit transliteration with diacritics within texts which I will need on an ongoing basis using this custom language (https://github.com/Shreeshrii/tesstrain-Sanskrit-IAST)

Steps to reproduce

1. ocrmypdf pg2.pdf pg2-3.pdf -l IAST

OCR engine does not have language data for the following requested languages:                             __main__.py:69
IAST
Please install the appropriate language data for your OCR engine.

See the online documentation for instructions:
    https://ocrmypdf.readthedocs.io/en/latest/languages.html

2. Using tesseract, IAST runs and recognizes diacritics (pg2.txt)
3. Using ocrmypdf without IAST (with pg2.pdf) , diacritics not recognized.  (pg2-2.pdf)

Note: most languages are identified by a 3-letter ISO 639-2 Code.
For example, English is 'eng', German is 'deu', and Spanish is 'spa'.
Simplified Chinese is 'chi_sim' and Traditional Chinese is 'chi_tra'.

Files

pg2-2.pdf pg.txt pg2.pdf

How did you download and install the software?

Ubuntu snap

OCRmyPDF version

v16.4.2+git1.39010dd2

Relevant log output

I noticed that when I run: 
t$ tesseract --list-langs
List of available languages (6):
IAST
chi_sim
chi_tra
ell
eng
osd

But in the verbose output for ocrmypdf: 

Running: ['tesseract', '--list-langs']                                                                   __init__.py:133
stdout/stderr = List of available languages (161):                                                        __init__.py:73
Arabic
Armenian
Bengali
Canadian_Aboriginal
Cherokee
Cyrillic
Devanagari
Ethiopic
Fraktur
Georgian
Greek
Gujarati
Gurmukhi
HanS
HanS_vert
HanT
HanT_vert
Hangul
Hangul_vert
Hebrew
Japanese
Japanese_vert
Kannada
Khmer
Lao
Latin
Malayalam
Myanmar
Oriya
Sinhala
Syriac
Tamil
Telugu
Thaana
Thai
Tibetan
Vietnamese
afr
amh
ara
asm
aze
aze_cyrl
bel
ben
bod
bos
bre
bul
cat
ceb
ces
chi_sim
chi_sim_vert
chi_tra
chi_tra_vert
chr
cos
cym
dan
deu
div
dzo
ell
eng
enm
epo
est
eus
fao
fas
fil
fin
fra
frk
frm
fry
gla
gle
glg
grc
guj
hat
heb
hin
hrv
hun
hye
iku
ind
isl
ita
ita_old
jav
jpn
jpn_vert
kan
kat
kat_old
kaz
khm
kir
kmr
kor
kor_vert
lao
lat
lav
lit
ltz
mal
mar
mkd
mlt
mon
mri
msa
mya
nep
nld
nor
oci
ori
osd
pan
pol
por
pus
que
ron
rus
san
sin
slk
slv
snd
spa
spa_old
sqi
srp
srp_latn
sun
swa
swe
syr
tam
tat
tel
tgk
tha
tir
ton
tur
uig
ukr
urd
uzb
uzb_cyrl
vie
yid
yor

OCR engine does not have language data for the following requested languages:                             __main__.py:69
IAST
Please install the appropriate language data for your OCR engine.

See the online documentation for instructions:
    https://ocrmypdf.readthedocs.io/en/latest/languages.html

Note: most languages are identified by a 3-letter ISO 639-2 Code.
For example, English is 'eng', German is 'deu', and Spanish is 'spa'.
Simplified Chinese is 'chi_sim' and Traditional Chinese is 'chi_tra'.
vchgan commented 2 days ago

Uninstalling/ reinstalling tesseract and ocrmypdf seem to have solved the issue. Sorry!