Open Shreeshrii opened 6 years ago
Related issue - https://github.com/tesseract-ocr/tesseract/issues/238
Arabic language (right to left in writing) stored (left to right) after create PDF Searchable
Wow this looks like really painful to handle...
I am trying to reverse every text child in the code in HOCRPdfExporter.cc line 729 painter.drawText(wordRect.x() px2pu, y px2pu, wordItem->text()); to be painter.drawText(wordRect.x() px2pu, y px2pu, reverseSTR(wordItem->text()));
this should be enough for Arabic RTL problem not sure about other complex scripts.
but having hard time to compile with docker!
I'll keep trying.
What issues are you encountering with docker? Happy to help there.
actually using docker toolbox on windows via Virtualbox vm instance When I try fist command to build the image, fedora OS says GPG key missing or something like that. Can I compile for windows from Ubuntu using docker? this was my next trial.
@Shreeshrii no HOCR and plain text outputs are correct
@bmwmy Looks like some transient issues with the Fedora repos, you can work around it by adding --nogpgcheck
to [1], i.e. dnf install -y --nogpgcheck
.
Sure, you can use docker on any OS it runs on.
[1] https://github.com/manisandro/gImageReader/blob/master/packaging/win32/Dockerfile#L9
FYI
Please see attached. It is the output from export to pdf from scribus 1.5.4svn. It seems to have correct Arabic support and loaded podofo as one of the components. Not sure if it helps with the hocr3pdf issue.
Related blog post: http://host-oman.blogspot.in/2017/02/first-5-arabic-books-typesetting-in.html
ShreeDevi
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
On Wed, Mar 7, 2018 at 5:32 PM, Sandro Mani notifications@github.com wrote:
@bmwmy https://github.com/bmwmy Looks like some transient issues with the Fedora repos, you can work around it by adding --nogpgcheck to [1], i.e. dnf install -y --nogpgcheck.
Sure, you can use docker on any OS it runs on.
[1] https://github.com/manisandro/gImageReader/blob/master/ packaging/win32/Dockerfile#L9
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/manisandro/gImageReader/issues/291#issuecomment-371117397, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o-wEmupGgVtXanUAUILYWwl5cEu1ks5tb8xrgaJpZM4Rat5f .
FYI
It seems that scribus dealt with rtl language differently as I realize they store Arabic text in reverse order. Some of IFs their to detect if it is Arabic text!
https://github.com/scribusproject/scribus/search?q=arabic&unscoped_q=arabic
The pdf output is not correct for Devanagari script when using the 3.2.3 experimental version for tesseract 4.0.0alpha.
Please see attached zip file with input image, text, hocr and pdf output.
If I copy the text from pdf and paste in notepad++, the rendering is correct. However rendering in the pdf file itself is incorrect.
skanda700test.zip