OCR-Generated Text Layers Not Readable by PDF Readers for RTL Languages Like Persian

ocrmypdf / OCRmyPDF

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched

http://ocrmypdf.readthedocs.io/

Mozilla Public License 2.0

13.9k stars 1.01k forks source link

OCR-Generated Text Layers Not Readable by PDF Readers for RTL Languages Like Persian #1157

Open PSEUDO-SAPPHO opened 1 year ago

PSEUDO-SAPPHO commented 1 year ago

What were you trying to do?

I have used ocrmypdf to perform OCR on a PDF document, but I'm encountering a specific issue with RTL (right-to-left) languages like Persian. Despite successful OCR processing, the text in the resulting PDF is not selectable or searchable within PDF readers like foxit reader or other popular PDF viewers.

I tested Foxit Reader and OCR-generated text was not rtl, However, when using Zotero's PDF reader, I observed that words are separated. It's worth noting that I tested this PDF on chrome and edge and i didn't encounter the issues, ocr works and text output is available with "ocrmypdf".

Where are you installing from?

Wndows package manager (chocolatey, etc.)

What operating system are you working on?

Windows

Relevant log output

No response

jbarlow83 commented 1 year ago

Please provide an example file, the command you're using, and the versions you're using.

medmedin2014 commented 1 year ago

@jbarlow83 @PSEUDO-SAPPHO

ocrmypdf: 15.1.0
Operating System: Manjaro Linux 
KDE Plasma Version: 5.27.8
KDE Frameworks Version: 5.110.0
Qt Version: 5.15.11
Kernel Version: 6.5.7-2-MANJARO (64-bit)
Graphics Platform: Wayland

I confirm the bug with Arabic, it puts a reversed text on the output pdf.

Source file: تقديم.pdf

Command: ocrmypdf -l ara -f تقديم.pdf out-تقديم.pdf

Output: out-تقديم.pdf

If you try to copy some text from the output pdf you will get Arabic letters copied in reverse order:

If you copy: Screenshot_20231014_124640

You get: يساردلا لشفلا ةلأسم تتاب

Instead of: باتت مسألة الفشل الدراسي

jbarlow83 commented 12 months ago

Unfortunately, this is an open issue in Tesseract PDF generation. https://github.com/tesseract-ocr/tesseract/issues/238 Other RTL languages might be affected too (Hebrew).

jbarlow83 commented 10 months ago

Fixed in v16

AhmadHakami commented 9 months ago

@jbarlow83: Fixed in v16

this problem has not been solved yet even with the updated version

tesseract v5.3.1
ocrmypdf 16.0.3

Reference: وبعد الاطلاع علی الترتیبات التنظیمیة للمؤسسة
Searchable pdf: دعبو عالطالا یلع تابیترتلا ةیمیظنتلا ةسسؤملل

jbarlow83 commented 9 months ago

To confirm I'm not insane, the English translation of the first line should be something like "The issue of academic failure has become a matter of concern to parents, teachers, and public opinion alike over the decades..."

I did some experiments - it's difficult since many programs handle RTL poorly, so it's hard to tell where what is working in the first place.

AhmadHakami commented 8 months ago

Hi @jbarlow83 any updates?

jbarlow83 commented 8 months ago

Both Tesseract and OCRmyPDF use the Glyphless font approach to RTL. Glyphless is a font where every glyph is mapped to a non-printing character. I've come to believe that this approach won't work for RTL languages across all PDF viewers, barely works for Tesseract and techniques that improve rendering for LTR languages over the Tesseract baseline don't work for RTL.

There are at least three ways to create RTL text and some viewers don't support some methods well.

At the very least I believe I need to add a new character to the Glyphless font, which would be the blank RTL character. That would allow RTL fonts to be inserted in an approach that is closer to how RTL fonts are typically rendering, as far as I know anyway.

It would probably also help to have a blank double-width character for CJK characters, and maybe something for vertical CJK.

Alternately it looks like Nato Sans has become a universal open source font and I could look into embedding it everywhere.

UsernamePlankalkul commented 5 days ago

Hi ... this is not the problem with Tesseract ... because the result of extrating RTL texts from images are fine in Tesseract ... its something with the ocrmypdf and maybe encoding or rearanging the charcters ... i'm still looking for the solution ...

SumatraPDF also show the corect arangment of characters . but we dont want to use the software because of poor performance and lack of facilities ...

Did anyone found the solution ?