Open PSEUDO-SAPPHO opened 1 year ago
Please provide an example file, the command you're using, and the versions you're using.
@jbarlow83 @PSEUDO-SAPPHO
ocrmypdf: 15.1.0
Operating System: Manjaro Linux
KDE Plasma Version: 5.27.8
KDE Frameworks Version: 5.110.0
Qt Version: 5.15.11
Kernel Version: 6.5.7-2-MANJARO (64-bit)
Graphics Platform: Wayland
I confirm the bug with Arabic, it puts a reversed text on the output pdf.
Source file: تقديم.pdf
Command:
ocrmypdf -l ara -f تقديم.pdf out-تقديم.pdf
Output: out-تقديم.pdf
If you try to copy some text from the output pdf you will get Arabic letters copied in reverse order:
If you copy:
You get: يساردلا لشفلا ةلأسم تتاب
Instead of: باتت مسألة الفشل الدراسي
Unfortunately, this is an open issue in Tesseract PDF generation. https://github.com/tesseract-ocr/tesseract/issues/238 Other RTL languages might be affected too (Hebrew).
Fixed in v16
@jbarlow83: Fixed in v16
this problem has not been solved yet even with the updated version
tesseract v5.3.1
ocrmypdf 16.0.3
Reference: وبعد الاطلاع علی الترتیبات التنظیمیة للمؤسسة
Searchable pdf: دعبو عالطالا یلع تابیترتلا ةیمیظنتلا ةسسؤملل
To confirm I'm not insane, the English translation of the first line should be something like "The issue of academic failure has become a matter of concern to parents, teachers, and public opinion alike over the decades..."
I did some experiments - it's difficult since many programs handle RTL poorly, so it's hard to tell where what is working in the first place.
Hi @jbarlow83 any updates?
Both Tesseract and OCRmyPDF use the Glyphless font approach to RTL. Glyphless is a font where every glyph is mapped to a non-printing character. I've come to believe that this approach won't work for RTL languages across all PDF viewers, barely works for Tesseract and techniques that improve rendering for LTR languages over the Tesseract baseline don't work for RTL.
There are at least three ways to create RTL text and some viewers don't support some methods well.
At the very least I believe I need to add a new character to the Glyphless font, which would be the blank RTL character. That would allow RTL fonts to be inserted in an approach that is closer to how RTL fonts are typically rendering, as far as I know anyway.
It would probably also help to have a blank double-width character for CJK characters, and maybe something for vertical CJK.
Alternately it looks like Nato Sans has become a universal open source font and I could look into embedding it everywhere.
Hi ... this is not the problem with Tesseract ... because the result of extrating RTL texts from images are fine in Tesseract ... its something with the ocrmypdf and maybe encoding or rearanging the charcters ... i'm still looking for the solution ...
SumatraPDF also show the corect arangment of characters . but we dont want to use the software because of poor performance and lack of facilities ...
Did anyone found the solution ?
What were you trying to do?
I have used ocrmypdf to perform OCR on a PDF document, but I'm encountering a specific issue with RTL (right-to-left) languages like Persian. Despite successful OCR processing, the text in the resulting PDF is not selectable or searchable within PDF readers like foxit reader or other popular PDF viewers.
I tested Foxit Reader and OCR-generated text was not rtl, However, when using Zotero's PDF reader, I observed that words are separated. It's worth noting that I tested this PDF on chrome and edge and i didn't encounter the issues, ocr works and text output is available with "ocrmypdf".
Where are you installing from?
Wndows package manager (chocolatey, etc.)
What operating system are you working on?
Windows
Relevant log output
No response