pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
https://pymupdf.readthedocs.io
GNU Affero General Public License v3.0
5.17k stars 495 forks source link

Change Visibility of OCR'd pdf text layer #3533

Closed mikejokic closed 4 months ago

mikejokic commented 4 months ago

Is your feature request related to a problem? Please describe.

I have OCR'd an image to generate a text layer over the image. This text layer is invisible in the pdf. I then use ghostscript to remove image and vector data to just keep the text layer to further reduce file size but keep page textual structure intact.

TestOCR.pdf - OCR'd image as pdf

TestOCR_textonly.pdf - removed image and vector data using ghostscript -dFILTERIMAGE -dFILTERVECTOR, We can highlight over this "blank" pdf to see the text layer is still there.

TestOCR.pdf

TestOCR_textonly.pdf

Describe the solution you'd like

Make this text layer visible in TestOCR_textonly.pdf. I want the OCR'd text to be visible following the same structural layout as the input. Can I change the render mode or color for all the text in this pdf to be visible? My pipeline will eventually deal with very large pdf files, so would like the solution to be performant as well.

@JorjMcKie I have tried your solutions for changing text font color found here but to no avail. Would really appreciate any support.

JorjMcKie commented 4 months ago

OCR-ed text may have been made invisible in a number of different ways. Choosing some color (like white-on-white or black cat in the night) is not among these alternatives. So changing the text color is the wrong idea.

Sometimes the text is written in "background", such that the image covers it. Most often though, OCRed text is stored with the PDF attribute "hidden" so removing the image will still not make it visible.

You can locate the respective PDF command 3 Tr and use a hacky way removing / changing it. In your case however, OCR was done with Tesseract obviously. Its `OCR-ed text may have been made invisible in a number of different ways. Choosing some color (like white-on-white or black cat in the night) is not among these alternatives. So changing the text color is the wrong idea.

Sometimes the text is written in "background", such that the image covers it. Most often though, OCRed text is stored with the PDF attribute "hidden" so removing the image will still not make it visible.

You can locate the respective PDF command 3 Tr and use a hacky way removing / changing it. In your case however, OCR was done with Tesseract obviously. Its GlyphLessFont means exactly that: it is a font for which no visible representation exists - IAW there exist no glyphs. So all the previous hacks will still not lead to visible text!

The only way I see is this approach:

  1. Remove the image` means exactly that: it is a font for which no visible representation exists - IAW there exist no glyphs. So all the previous hacks will still not lead to visible text!

The only way I see is this approach:

  1. Remove image(s) etc.
  2. Replace the OCR-ed text may have been made invisible in a number of different ways. Choosing some color (like white-on-white or black cat in the night) is not among these alternatives. So changing the text color is the wrong idea.

Sometimes the text is written in "background", such that the image covers it. Most often though, OCRed text is stored with the PDF attribute "hidden" so removing the image will still not make it visible.

You can locate the respective PDF command 3 Tr and use a hacky way removing / changing it. In your case however, OCR was done with Tesseract obviously. Its GlyphLessFont means exactly that: it is a font for which no visible representation exists - IAW there exist no glyphs. So all the previous hacks will still not lead to visible text!

The only way I see is this approach:

  1. Remove the image(s) etc.
  2. Replace GlyphLessFont by Courier. GlyphLessFont is a mono-spaced font, so Courier is a possible / good choice. You can use the font replacement script here.

This is what comes out in your test case: image

Pretty ugly ... 🤷‍♂️