manisandro / gImageReader

A Gtk/Qt front-end to tesseract-ocr.
GNU General Public License v3.0
1.56k stars 187 forks source link

Investigate complex scripts in PoDoFo PDF export #291

Open Shreeshrii opened 6 years ago

Shreeshrii commented 6 years ago

The pdf output is not correct for Devanagari script when using the 3.2.3 experimental version for tesseract 4.0.0alpha.

Please see attached zip file with input image, text, hocr and pdf output.

If I copy the text from pdf and paste in notepad++, the rendering is correct. However rendering in the pdf file itself is incorrect.

skanda700test.zip

manisandro commented 6 years ago

I fear this is a general issue with PoDoFo and complex scripts - resp more work is needed have PoDoFo handle these correctly.

manisandro commented 6 years ago

Actually, isn't it just a matter of picking the right font? I tried with a test image you sent me a while ago, installed the Lohit Devanagari font, selected that font for PDF export, and the output looks reasonable (from what I can judge), see attachment.

devanagari-text.pdf

Shreeshrii commented 6 years ago

It should work correctly with any Devanagari Unicode font.

The problem is not the font, rather it is the complex script rendering. In Devanagari there is reordering of of certain combining marks. Also, multiple consonants together give rise to different glyphs.

PoDoFo exported pdf has letters overlapping each other. The combining mark for i maatraa is not getting reordered to before the consonant - see lines 2 and 3.

I copied the text from the pdf you posted above into notepad++ and then printed it as pdf (in Lohit Devanagari font) so that it is easy to compare.

Please see attached.

devanagari-text-lohit-notepad.pdf

manisandro commented 6 years ago

Ah I see. Do you have any idea how tesseract handles this?

Shreeshrii commented 6 years ago

I think Cairo, Pango, Harfbuzz etc provide the support.

I had done a search in podofo archives earlier today, the only ref I found related to this is in the thread https://sourceforge.net/p/podofo/mailman/message/32425071/ As of 2014, it seemed that podofo did not support this.

manisandro commented 6 years ago

Yeah I read the same thread - as I read it, PoDoFo isn't capable of handling it for you, but it should be possible to handle it with custom code outside of PoDoFo.

manisandro commented 6 years ago

But looking at the tesseract source, in particular pdfrenderer.cpp, I see no traces of pango or harfbuzz. It would be sufficient to figure out the low-level blocks that tesseract adds to the PDF, I can then just also write low-level blocks via PoDoFo instead of using the DrawText method I suppose.

Shreeshrii commented 6 years ago

https://github.com/phuang/pango https://www.cairographics.org/

Take a look at stringrenderer

https://github.com/tesseract-ocr/tesseract/blob/c773eb5784a9b895008240f23054d2ff916786a5/training/stringrenderer.cpp

manisandro commented 6 years ago

Okay I'll take a look when I find a moment.

Shreeshrii commented 6 years ago

Maybe it will work better with the Qtprinter.

http://doc.qt.io/qt-5/internationalization.html

ShreeDevi


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Thu, Jan 11, 2018 at 8:45 PM, Sandro Mani notifications@github.com wrote:

Okay I'll take a look when I find a moment.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/manisandro/gImageReader/issues/291#issuecomment-356963831, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_oyT8NRRi7iams6ovQ_L5M2E10ZQgks5tJiV6gaJpZM4Rat5f .

manisandro commented 6 years ago

@Shreeshrii I've added a QPrinter backend for PDF export, please give it a try.

Shreeshrii commented 6 years ago

@manisandro Thanks for addressing this issue. Do you have a windows binary that I can test? I am on windows 10.

Shreeshrii commented 6 years ago

I tried with a test image you sent me a while ago, installed the Lohit Devanagari font, selected that font for PDF export, and the output looks reasonable (from what I can judge), see attachment.

If it is not possible to provide the windows binary now, please create the test output as you had done before.

manisandro commented 6 years ago

Here you go:

Shreeshrii commented 6 years ago

Thanks!! It is working great. I tested with Devanagari, san (Sanskrit) and Gurmukhi traineddata files.

I am attaching input files and pdfs from the test.

siddhanta.pdf siddhanta

hin-eng hin-eng.pdf

Shreeshrii commented 6 years ago

Two unrelated items that I noticed:

  1. In HOCR mode, it is not possible to select a section of image for processing. The selection crosshair is displayed but it does not do any selection.

  2. If selecting podofo printer backend, pdf is not created/is zero size/locks the pdf file in some manner. If qtprinter is selected after that, pdf file is not allow to be opened.

Good to see the export to odt option (is this a new feature?).

manisandro commented 6 years ago
  1. Correct, hOCR is always page based (due to the nature of the hOCR format). While clearly a subset of a document can also be seen as a hOCR page, things get complicated when you have to start merging hOCR documents which represent different portions of the same image.
  2. Need to investigate, might be a regression with the code I introduced last night.

Yes, ODT is indeed new.

Overall testing is very much welcome since I'd like to push out a new release soon.

Shreeshrii commented 6 years ago
  1. Then in HOCR mode the selection crosshair should not be displayed.

  2. I have not tested podofo with a english document, just with these complex script ones. At one time I saw a help text telling that qtprinter should be used for complex scripts - but not sure where the cursor was hovering at that point. I couldn't get it to display again.

Would it help to make qtprinter as the default choice showing up in export pdf dialogue for complex scripts?

manisandro commented 6 years ago
  1. Valid point
  2. Just click on the hint-icon next to the combobox
  3. I don't know how do reliably detect whether complex scripts are involved.
Shreeshrii commented 6 years ago

When using `pdf with invisible text overlay', the pdf file size becomes much larger.

eg. using a 600dpi image of 512kb size. The resulting pdf is 1175kb with default setting of 300dpi in the export dialog.

hin-eng.pdf hin-eng

manisandro commented 6 years ago

Ah apropos, I see now that the windows build is missing some icons, hence why you i.e. can't see the hint icon.

manisandro commented 6 years ago

Size: that's the price of QPrinter. Nothing I can do about that. QPrinter internally hard-codes the image compression method to JPEG@94% quality.

Shreeshrii commented 6 years ago

Please see http://doc.qt.io/qt-5/qimagewriter.html

Qt provides the QImageWriter class which supports setting format specific options, such as the gamma level, compression level and quality, prior to storing the image.

Shreeshrii commented 6 years ago

By changing the image options in export `pdf with invisible text overlay' with qtprinter, the pdf size can be reduced.

I changed the settings from color to monchrome and dpi from 300 to 100.

The resulting pdf size is now 355 kb instead of 1175kb.

hin-eng.pdf

manisandro commented 6 years ago

Please see http://doc.qt.io/qt-5/qimagewriter.html

Sure, but QPrinter does not use QImageWriter

Shreeshrii commented 6 years ago

You could offer option to change the printermode as part of export pdf dialog

enum PrinterMode { ScreenResolution, PrinterResolution, HighResolution }
manisandro commented 6 years ago

That enum has no effect since it is overridden by the resolution the user chooses.

Shreeshrii commented 6 years ago

I changed the settings from color to monchrome and dpi from 300 to 100. The resulting pdf size is now 355 kb instead of 1175kb.

Changed format to grayscale instead of monochrome and 100 dpi, resulting pdf is 276kb.

Of course, without original image, the pdf size is much smaller, so could be made at 300 dpi.

manisandro commented 6 years ago

For monochrome you really need CCITT/FAX encoding to have a reasonably small file size, but as mentioned, it is not doable with QPrinter.

Shreeshrii commented 6 years ago

Thanks!

Going back to the original issue report and current status:

originally with podofo, for Devanagari script

If I copy the text from pdf and paste in notepad++, the rendering is correct. However rendering in the pdf file itself is incorrect.

currently with podofo, for Devanagari script

pdf is not created/is zero size/locks the pdf file.

currently with qtprinter, for Devanagari script

The rendering in pdf preview and pdf file is correct. Overlapping character problem can be fixed by reducing the font size %. However, when I copy the text from pdf and paste in notepad++, the rendering is incorrect.

manisandro commented 6 years ago

The rendering in pdf preview and pdf file is correct. Overlapping character problem can be fixed by reducing the font size %. However, when I copy the text from pdf and paste in notepad++, the rendering is incorrect.

Well that sucks. I don't think there is anything I can do here... Again, it is QPainter internals.

Shreeshrii commented 6 years ago

Assuming that the regression regarding podofo and Devanagari can be fixed, I think the best option might be to use

Podofo With invisible text layer pdf With the fax level compression for the image

That way, the visible part of pdf will be correct since it uses the original image.

And, the text layer will be correct (as per earlier test with podofo).

manisandro commented 6 years ago

PoDoFo can definitely be fixed, I'll test it on windows this evening and see what went wrong, I'll post a fresh test build as soon as I fixed things.

It is kinda odd though that the Devanagari script is correctly rendered using QPainter, but is wrong when copying.

Shreeshrii commented 6 years ago

This is a known problem with most pdf writers for complex scripts. The glyphs for combined consonants, reordered combining marks do not get copied correctly from pdfs.

Xetex with its support for actual text renders it correctly, and so also PoDoFo, based on my earlier test.

Pdfs created by Open office, libre office also have same problems.

Shreeshrii commented 6 years ago

complex script text can also be copied correctly from pdfs created by tesseract, which use the original image for the visual layer.

manisandro commented 6 years ago

About the PoDoFo locking issue: isn't it just that you have the output PDF open in a PDF viewer or such which is locking the file?

Shreeshrii commented 6 years ago

With PoDoFo

Export to PDF dialog closes but there is no indication whether the export is completed.

When I look in File Manager, it shows a pdf of 0kb.

On refreshing File Manager after a while, pdf file shows up with a size.

When I double click to open it, Adobe Reader gives an error saying file in use or open in another application.

So, it seems to be locked by gimagereader.

manisandro commented 6 years ago

Are you creating a new file when exporting or overwriting an existing one? If the latter, are you sure that file isn't open in another application?

manisandro commented 6 years ago

I've updated the test builds with a couple of fixes, one might be related to the issue you are seeing. Links as usual:

Shreeshrii commented 6 years ago

Thanks for the prompt test build. It is working fine now, i.e.

  1. HOCR mode, crosshair not being displayed.
  2. Hint-icon is being displayed next to combo-box.
  3. PoDoFo printer is NOT locking the pdf file.

For Devanagari, the export to pdf option that worked well for my test image:

The generated pdf is 164 kb. Original image was 512kb at 600 dpi. The Devanagari text can be correctly copied and pasted as text. (If you want to test this for other complex scripts, export to TXT and export to PDF and copy and paste text from that and compare the two files).

manisandro commented 6 years ago

So you are using PoDoFo "PDF with invisible text overlayer" and QPrinter for regular PDF, right?

Shreeshrii commented 6 years ago

Yes, I think both options should be available for users.

On 02-Feb-2018 1:54 PM, "Sandro Mani" notifications@github.com wrote:

So you are using PoDoFo "PDF with invisible text overlayer" and QPrinter for regular PDF, right?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/manisandro/gImageReader/issues/291#issuecomment-362518182, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o3PqSKZ49KpFF5a5N1RCEnBpy-9Nks5tQsZVgaJpZM4Rat5f .

Shreeshrii commented 6 years ago

Does the hocr option have a way to display only those words which have low confidence? Might make it easier for correction.

Also, using different traineddata files, in hocr mode, diff words get dropped from recognition.

manisandro commented 6 years ago

Does the hocr option have a way to display only those words which have low confidence? Might make it easier for correction.

How would you define "low"?

Also, using different traineddata files, in hocr mode, diff words get dropped from recognition.

Not following what you mean.

Shreeshrii commented 6 years ago
  1. Low could be a user defined percentage. I will have to check, but I think for devanagari documents the confidence level was as low as 0 for some words.

  2. Again, based on test for documents in devanagari script, which can be processed using multiple traineddata files such as devanagari, hin, san, mar and nep. The OCR process drops certain words in recognition. However, diff language traineddata give diff results. Eg. a word may be dropped by devanagari but recognised by hin.

This dropping of words might also be related to confidence levels.

A related question is, is the hocr demarcation of text blocks etc. a common layout analysis routine, or does it dependent on traineddata?

I will provide an example with samples tomorrow. That will help clarify.

manisandro commented 6 years ago

I'm afraid I can't help much with traineddata issues or with what hocr text tesseract produces. You'll have to take those issues upstream.

Shreeshrii commented 6 years ago

You'll have to take those issues upstream.

That's what I thought. Thanks!

I looked at the word confidence values. They range from 0 to 90+. It would be helpful to have a filter on the conf values, so e.g. a user could choose to look at values below 10%, 20%, 50% - any threshold they choose.

conf_level

bmwmy commented 6 years ago

I am trying to use PoDoFo for Arabic. Seems letters get reversed in each word.

Hello World becomes olleW dlroW

I suggest giving option to output rtl (which reverse letters in each word).

Because Qprinter is not suitable for monochrome docs. 18mb orig. pdf file becomes 1.4gb pdf with invisible text using qprinter.

hocr2pdf is a good alternative and consider also itext library.

manisandro commented 6 years ago

hocr2pdf is just a node.js wrapper around tesseract AFAICS, and itext is a proprietary Java/.NET library.

For it to work in gImageReader I'm afraid I currently don't see any other way than actually implementing the missing support for complex scripts in PoDoFo. This though requires thorough knowledge of the PDF spec and time, both of which are currently lacking.

Shreeshrii commented 6 years ago

@bmwmy is the reversal problem also there in the txt and HOCR output of tesseract?