Open insinfo opened 3 weeks ago
Latest Tesseract with the model script/Latin gives a better result for the first image:
ESTADO DO RIO DE JANEIRO
Prefeitura Municipal de Rio das Ostras
PROTOCOLO GERAL
Cent EO
Processo: 18457 / 2003 Data: 03/09/2003 Hora: 10:53:56
Requerente: COSCARELLI E CIA LTDA ME 2, '
` Sec Destino: Secretaria Municipal de rarako OS
Dept.Destino: Dept? de Tributos è Fiscalização
Assunto: ALVARA A i L J: j4 0
ES
@stweil
What is the config to get this result in portuguese? Is it "-l lat+script/Latin" or "-l por+script/Latin"?
config_tesseract = fr'--tessdata-dir "{TESSDATA_PREFIX}" -l lat+script/Latin --oem 3 --psm 6'
It's simply -l script/Latin
(or -l Latin
, depending on your Linux distribution or local installation). The script Latin includes all Western European languages which are using the same script (instead of Greek or Cyrillic).
Note also that a correct installation of Tesseract does not need --tessdata-dir
or TESSDATA_PREFIX
, so avoid both (unless you have very special needs).
I did a test to OCR scanned documents in Brazilian Portuguese, and I saw that Tesseract makes a lot of mistakes on scanned documents in Portuguese
Current Behavior
result from https://huggingface.co/spaces/kneelesh48/Tesseract-OCR
Expected Behavior
Current Behavior
result from https://huggingface.co/spaces/kneelesh48/Tesseract-OCR
Expected Behavior
the correct thing would be
Windows 11
https://huggingface.co/spaces/kneelesh48/Tesseract-OCR