tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
59.53k stars 9.23k forks source link

Tesseract doesn't always recognise diacritics #4276

Open arsinclair opened 5 days ago

arsinclair commented 5 days ago

Current Behavior

I'm using Tesseract indirectly as part of OCRmyPDF and I'm coming here from this issue.

When OCR'ing English (Latin) text with diacritics it doesn't always recognise them. The diacritics in my document are part of surnames originating from Hungary and Belgium.

I've tried with just English, English + Hungarian dictionaries, also tried with Latin script (which has extended character map) to no avail.

The words: poéme, pathétique, animé are recognised.

The words: Ysaÿe, Jenő, Petőfi, etc. are not recognised.

The words csárdás, Telmányi, Dvořák are recognised only with Latin script.

Expected Behavior

The diacritics should be recognised.

Source files ![000001_ocr](https://github.com/tesseract-ocr/tesseract/assets/2878904/6b16a7a3-c520-4787-b6a0-370901979bf7) ![000004_ocr](https://github.com/tesseract-ocr/tesseract/assets/2878904/27f9fb44-7f32-4de7-9668-a3cd780b246b) ![000005_ocr](https://github.com/tesseract-ocr/tesseract/assets/2878904/c0eb3b7e-940b-4d6c-bd83-5033d5805c1c) ![000002_ocr](https://github.com/tesseract-ocr/tesseract/assets/2878904/8bec3994-8530-4cba-8590-3d8d4414a123) ![000003_ocr](https://github.com/tesseract-ocr/tesseract/assets/2878904/aa9c950d-d572-4861-b870-f5e60acdd102)

tesseract -v

tesseract 5.3.4 leptonica-1.82.0 libgif 5.2.2 : libjpeg 6b (libjpeg-turbo 2.1.5) : libpng 1.6.43 : libtiff 4.5.1 : zlib 1.3 : libwebp 1.4.0 : libopenjp2 2.5.0 Found AVX2 Found AVX Found FMA Found SSE4.1 Found OpenMP 201511 Found libarchive 3.7.2 zlib/1.3.1 liblzma/5.4.5 bz2lib/1.0.8 liblz4/1.9.4 libzstd/1.5.5 Found libcurl/8.8.0 OpenSSL/3.2.2 zlib/1.3.1 brotli/1.1.0 zstd/1.5.5 libidn2/2.3.7 libpsl/0.21.2 libssh2/1.11.0 nghttp2/1.62.1 librtmp/2.3 OpenLDAP/2.5.18

Operating System

Debian Testing (Bookworm)

uname -a

Linux jrm-ws 6.8.12-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.8.12-1 (2024-05-31) x86_64 GNU/Linux

stweil commented 5 days ago

eng.traineddata was not trained with diacritics (see https://github.com/tesseract-ocr/langdata_lstm/blob/main/eng/eng.unicharset) and therefore cannot recognize them.

Latin.traineddata was trained with some diacritics (see https://github.com/tesseract-ocr/langdata_lstm/blob/main/script/Latin/Latin.unicharset) and therefore works better with your text. As far as I see "ő" is missing in its supported characters.

So your results are expected with the given models, and it's not a Tesseract issue.

"ő" is included in hun.traineddata, so you could try Latin+hun, but training a new model would be better.

arsinclair commented 4 days ago

"ő" is included in hun.traineddata, so you could try Latin+hun, but training a new model would be better.

Tried with Hungarian and Latin too, didn't always work. And if training the new model is the only way forward, I'll have to do it myself, or it can be added to the existing Tesseract models?