manisandro / gImageReader

A Gtk/Qt front-end to tesseract-ocr.
GNU General Public License v3.0
1.57k stars 187 forks source link

Investigate complex scripts in PoDoFo PDF export #291

Open Shreeshrii opened 6 years ago

Shreeshrii commented 6 years ago

The pdf output is not correct for Devanagari script when using the 3.2.3 experimental version for tesseract 4.0.0alpha.

Please see attached zip file with input image, text, hocr and pdf output.

If I copy the text from pdf and paste in notepad++, the rendering is correct. However rendering in the pdf file itself is incorrect.

skanda700test.zip

Shreeshrii commented 6 years ago

Related issue - https://github.com/tesseract-ocr/tesseract/issues/238

Arabic language (right to left in writing) stored (left to right) after create PDF Searchable

manisandro commented 6 years ago

Wow this looks like really painful to handle...

bmwmy commented 6 years ago

I am trying to reverse every text child in the code in HOCRPdfExporter.cc line 729 painter.drawText(wordRect.x() px2pu, y px2pu, wordItem->text()); to be painter.drawText(wordRect.x() px2pu, y px2pu, reverseSTR(wordItem->text()));

this should be enough for Arabic RTL problem not sure about other complex scripts.

but having hard time to compile with docker!

I'll keep trying.

manisandro commented 6 years ago

What issues are you encountering with docker? Happy to help there.

bmwmy commented 6 years ago

actually using docker toolbox on windows via Virtualbox vm instance When I try fist command to build the image, fedora OS says GPG key missing or something like that. Can I compile for windows from Ubuntu using docker? this was my next trial.

bmwmy commented 6 years ago

@Shreeshrii no HOCR and plain text outputs are correct

manisandro commented 6 years ago

@bmwmy Looks like some transient issues with the Fedora repos, you can work around it by adding --nogpgcheck to [1], i.e. dnf install -y --nogpgcheck.

Sure, you can use docker on any OS it runs on.

[1] https://github.com/manisandro/gImageReader/blob/master/packaging/win32/Dockerfile#L9

Shreeshrii commented 6 years ago

FYI

Please see attached. It is the output from export to pdf from scribus 1.5.4svn. It seems to have correct Arabic support and loaded podofo as one of the components. Not sure if it helps with the hocr3pdf issue.

Related blog post: http://host-oman.blogspot.in/2017/02/first-5-arabic-books-typesetting-in.html

ShreeDevi


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Wed, Mar 7, 2018 at 5:32 PM, Sandro Mani notifications@github.com wrote:

@bmwmy https://github.com/bmwmy Looks like some transient issues with the Fedora repos, you can work around it by adding --nogpgcheck to [1], i.e. dnf install -y --nogpgcheck.

Sure, you can use docker on any OS it runs on.

[1] https://github.com/manisandro/gImageReader/blob/master/ packaging/win32/Dockerfile#L9

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/manisandro/gImageReader/issues/291#issuecomment-371117397, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o-wEmupGgVtXanUAUILYWwl5cEu1ks5tb8xrgaJpZM4Rat5f .

bmwmy commented 6 years ago

FYI

It seems that scribus dealt with rtl language differently as I realize they store Arabic text in reverse order. Some of IFs their to detect if it is Arabic text!

https://github.com/scribusproject/scribus/search?q=arabic&unscoped_q=arabic