kcroker / dpsprep

Python DJVU to PDF converter which preserves OCR text and bookmark metadata (e.g. TOC)
Other
191 stars 16 forks source link

no OCR after conversion (wrongly OCR'ed djvus?) #20

Open maras opened 1 year ago

maras commented 1 year ago

This issue continues that part of #16 about OCR, but with other files.

Two files. File Kornai. I can correctly copy text from djvu file in DjVu4, but not in Ocular, I can't see boxes of text in blue in latter. Evince let me see boxes and copy text (even correct), but very strangely, you could see (wrong orientation and placement, I was copying first paragraph): 2023-10-28-185001 No OCR after conversion. Something is wrong with djvu file, I doubt that can be solved without re-OCR.

File 2.djvu has correct (with many mistakes, but that shouldn't matter, I think) OCR that can be seen in Ocular and other viewers, I can copy text correctly from them. And no OCR after conversion. This case is more strange, because djvu file seems normal.

v-- commented 1 year ago

The two files have a common issue - an unrecognized type of text annotation. I added this new "region" list expression type and both files now have text annotations.

I originally put the corresponding warnings as debug messages (only viewable via --verbose), but now that I think of it, they are better off as warnings. So text processing should now print more obvious error messages for unrecognized list expressions.

One thing I've noticed regarding the second file ("Психология наровод и наций") is that it is very large (between 400 and 600MiB depending on the optimizations). This may cause performance issues with PDF viewers.

The first file's ("The Socialist System") translated text layer is bonkers:

Screenshot_20231029_105404

The DjVu looks a little better, but is still weird:

Screenshot_20231029_111141

As discussed previously, I don't know how to assign text to specified boxes in PDF files, which is what needs to be done here.

maras commented 1 year ago

Yes, the first file (Kornai) has a rubbish text layer after the update of dpsprep. It might be that it was OCR'ed in vertical position and after that position was changed to horizontal but coordinates of the text layer somehow remained the same. That how it seems to me wile looking to the djvu text layer – it is obviously vertical. I saw several such files. Re-OCR'ed.

Better situation is with the second file (Психология) – I can copy text from the converted pdf as from djvu. So this update improved conversion :). Thank you.

Some files become very big after conversions, that's true. But quite rarely. Algorithms of compression in djvu still is better than in pdf. My converted second file is 308 MB. I'm using options --quality=50 -O3 ordinarily. Maybe too harsh, but they are working quite good even with pictures in books. I tested even --quality=10 -O3 and it worked nicely with text but was too harsh for pictures. And with such options this file is 128 MB. Quite good result and I see no worsening of quality of the text.