pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
https://pymupdf.readthedocs.io
GNU Affero General Public License v3.0
5.72k stars 529 forks source link

Partial OCR using "get_textpage_ocr" ignores image masks while extracting text #3842

Open rohitlal125555 opened 2 months ago

rohitlal125555 commented 2 months ago

Description of the bug

I have a pdf document from which I want to extract text. PDF - https://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.800-223.pdf

For extracting the text on Page-7 (TOC), I used the get_textpage_ocr with the full argument as False, since the page has both digital text and text represented by image. However, the output of this function returns the text component only and not the OCRed text from the image parts.

While looking into the code of the get_textpage_ocr function (in utils.py), I see it iterates over each block and identifies the image blocks using type=1 filter. Then it extracts the image from the block, builds a Pixmap, and passes it to the OCR component.

tpage = page.get_textpage(flags=flags)
for block in page.get_text("dict", flags=pymupdf.TEXT_PRESERVE_IMAGES)["blocks"]:
    if block["type"] != 1:  # only look at images
        continue
    bbox = pymupdf.Rect(block["bbox"])
    if bbox.width <= 3 or bbox.height <= 3:  # ignore tiny stuff
        continue
    exception_types = (RuntimeError, mupdf.FzErrorBase)
    if pymupdf.mupdf_version_tuple < (1, 24):
        exception_types = RuntimeError
    try:
        pix = pymupdf.Pixmap(block["image"])  # get image pixmap
        if pix.n - pix.alpha != 3:  # we need to convert this to RGB!
            pix = pymupdf.Pixmap(pymupdf.csRGB, pix)
        if pix.alpha:  # must remove alpha channel
            pix = pymupdf.Pixmap(pix, 0)
        imgdoc = pymupdf.Document(
                "pdf",
                pix.pdfocr_tobytes(language=language, tessdata=tessdata),
                )  # pdf with OCRed page
        imgpage = imgdoc.load_page(0)  # read image as a page
        pix = None
        # compute matrix to transform coordinates back to that of 'page'
        imgrect = imgpage.rect  # page size of image PDF
        shrink = pymupdf.Matrix(1 / imgrect.width, 1 / imgrect.height)
        mat = shrink * block["transform"]
        imgpage.extend_textpage(tpage, flags=0, matrix=mat)
        imgdoc.close()

However, this function does not consider the image mask. I think due to this reason the extracted image is a masked image (which visually looks completely black), and that is why Tesseract is not able to extract any text for those image parts.

Further, I'm aware that there exists a Page.get_images() function which returns the xref and smask, which can be later used to unmask the images using the below code -

pix1 = pymupdf.Pixmap(doc.extract_image(xref)["image"])    # (1) pixmap of image w/o alpha
mask = pymupdf.Pixmap(doc.extract_image(smask)["image"])   # (2) mask pixmap
pix = pymupdf.Pixmap(pix1, mask)                           # (3) copy of pix1, image mask added

Using this method, I'm able to get the image with readable text (unlike the black image which is being extracted internally within the get_textpage_ocr function.

Can we update the page.get_text function _(which is called inside the get_textpage_ocr function)_ to keep both image and smask values in the block dictionary, or at least the xref of the image so that one can extract the smask using the xref.

I can't use the page.get_images in my application since I need the bounding boxes coordinates as well, which are only provided in the block dictionary retrieved via page.get_text.

Any ideas to resolve this issue? Let me know if you need any more information to replicate this issue.

How to reproduce the bug

How to Reproduce

  1. Download the pdf file - https://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.800-223.pdf
  2. Extract the text page 7 of this pdf using partial OCR using the below code -
import pymupdf
doc  = pymupdf.open('resouces/NIST.SP.800-223.pdf')
page = doc[6]    # since index starts from zero  

partial_tp = page.get_textpage_ocr(flags=0, full=False)
text_p_ocr = page.get_text(textpage=partial_tp)
print(text_p_ocr)

PyMuPDF version

1.24.10

Operating system

Windows

Python version

3.10

JorjMcKie commented 2 months ago

Thanks for the report. It does look like you have a valid point here ...

However, there is a major problem here: The method looks at actual images on the page only. It knows nothing about PDF xrefs. The method used does not currently return image masks and will have to be changed to also return the image mask binary in addition. Alternatively, we may look at the image boundary boxes on the page itself and make a pixmap from respective areas instead of this. Probably faster to implement ...

Anyway - it will take its time.

rohitlal125555 commented 2 months ago

Yeah, I was trying to modify the get_text function to update the block dictionary with mask along with the image binary. However, based on my (limited) understanding by going through the library code in the past 24 hours, it seems like the preparation of the block dictionary is happening outside this Python library (best guess - it happens in the MuPDF backend). I was able to trace the function - JM_make_textpage_dict(*args, **kwargs) which has no real definition in Python code.

I didn't fully understand your alternate proposal. Currently, pixmap is indeed getting prepared by using the binary image data coming from the block dictionary. Do you mean rather than taking the binary data from the block, we take the image data from someplace else? (which in some way resolves our problem of unmasking the image) pix = pymupdf.Pixmap(block["image"])

I have another rudimentary idea - If the number and sequence of images being returned from the 2 functions (page.get_images & page.get_text) are always the same, then perhaps I can do a 1-1 mapping between the 2 outputs to get the image and mask. However, this is only possible if the former assumption is true.

JorjMcKie commented 2 months ago

No, no. I explained my point poorly. If an image is detected on the page (and eligible WRT to its bbox size - like at least 20 x 20 or so), then do pix=page.get_pixmap(dpi=large, clip=bbox). This ignores all technical details of the image (mask and what not else). The pixmap will look exactly like the image on page right from the beginning.