tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
60.94k stars 9.37k forks source link

Add better support for Brazilian Portuguese #4302

Open insinfo opened 3 weeks ago

insinfo commented 3 weeks ago

I did a test to OCR scanned documents in Brazilian Portuguese, and I saw that Tesseract makes a lot of mistakes on scanned documents in Portuguese

Current Behavior

result from https://huggingface.co/spaces/kneelesh48/Tesseract-OCR

1-1

ESTADO DO RIO DE JANEIRO
Prefeitura Municipal de Rio das Ostras
PROTOCOLO GERAL

“Chast vO

Precesse: 18457 J 2003 Data: 03/09/2003 Hora: 10:53:56
Requerente: COSCARELLI E CIALTDA ME 2 ;
* Sec.Destino: Secretaria Municipal de Fazend we
Dept.Destine: Dept? de Tributes @ Fiscalizagao

4
Assunto: ALVARA o Lh 3. )40

Expected Behavior

ESTADO DO RIO DE JANEIRO
Prefeitura Municipal de Rio das Ostras
PROTOCOLO GERAL

Processo: 18457 / 2003
Data: 03/09/2003
Hora: 10:53:56
Requerente: COSCARELLI E CIA LTDA ME
Sec. Destino: Secretaria Municipal de Fazenda
Dept. Destino: Depto. de Tributos e Fiscalização
Assunto: ALVARÁ

Current Behavior

result from https://huggingface.co/spaces/kneelesh48/Tesseract-OCR

110-1

ESTADO DO RIO DE JANEIRO
Prefeitura Municipal de Rio das Ostras
PROTOCOLO GERAL

Frocesso 153

14 ¢ 2003 data 2540712003 Hora: 16:48:28

COLOMIA DE PESCADOPES 2.00

a oe

pcos

Expected Behavior

the correct thing would be

ESTADO DO RIO DE JANEIRO
Prefeitura Municipal de Rio das Ostras
PROTOCOLO GERAL

Processo: 15314 / 2003
Data: 25/07/2003
Hora: 16:18:28

Requerente: COLÔNIA DE PESCADORES Z-22
Sec. Destino: Sec. Mun. Urbanismo Obras e S. Pub.
Dept. Destino: 0
Assunto: AGRADECIMENTO / FAZ

Windows 11

https://huggingface.co/spaces/kneelesh48/Tesseract-OCR

stweil commented 3 weeks ago

Latest Tesseract with the model script/Latin gives a better result for the first image:

ESTADO DO RIO DE JANEIRO

Prefeitura Municipal de Rio das Ostras
PROTOCOLO GERAL

Cent EO
Processo: 18457 / 2003 Data: 03/09/2003 Hora: 10:53:56
Requerente: COSCARELLI E CIA LTDA ME 2, '
` Sec Destino: Secretaria Municipal de rarako OS
Dept.Destino: Dept? de Tributos è Fiscalização

Assunto: ALVARA A i L J: j4 0

ES
filipe-smartins commented 1 week ago

@stweil

What is the config to get this result in portuguese? Is it "-l lat+script/Latin" or "-l por+script/Latin"?

config_tesseract = fr'--tessdata-dir "{TESSDATA_PREFIX}" -l lat+script/Latin --oem 3 --psm 6'

stweil commented 1 week ago

It's simply -l script/Latin (or -l Latin, depending on your Linux distribution or local installation). The script Latin includes all Western European languages which are using the same script (instead of Greek or Cyrillic).

stweil commented 1 week ago

Note also that a correct installation of Tesseract does not need --tessdata-dir or TESSDATA_PREFIX, so avoid both (unless you have very special needs).