pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
https://pymupdf.readthedocs.io
GNU Affero General Public License v3.0
5.55k stars 520 forks source link

Exploring Text and Bounding Box Extraction Anomalies in PDFs with PyMuPDF #3248

Closed satvik-27199 closed 8 months ago

satvik-27199 commented 8 months ago

Exploring Text and Bounding Box Extraction Anomalies in PDFs with PyMuPDF

Thank you for the excellent work. I'd like to mention an issue related to text and bounding box (bbox) extraction. I've been attempting to extract text values and their corresponding position vectors from tables, and the solution works wonderfully. However, I've encountered a problem with some PDFs (sample attached). When trying to extract information using PyMuPDF, despite there being a significant whitespace between the words "GP" and "Unreserved," it groups them into one block. To understand the root cause, I conducted a word-level bbox extraction and discover ed that the space between "GP" and "Unreserved" is only 2-3 points in the x-coordinate space, which visually does not seem accurate. For example, the space for the "Reserved" vector spans approximately 30 points (274.56 - 244.95). So, why is the gap between "GP" and "Unreserved" only around 3 points (289.20 - 286.53)?

(242.27, 422.79, 339.06, 429.30, ' Reserved GP Unreserved GP\n')

(244.95, 422.79, 274.56, 429.30, 'Reserved', 9, 0, 0) (277.23, 422.79, 286.53, 429.30, 'GP', 9, 0, 1) (289.20, 422.79, 327.09, 429.30, 'Unreserved', 9, 0, 2) (329.76, 422.79, 339.06, 429.30, 'GP', 9, 0, 3)

7.pdf

Screen Shot 2024-03-09 at 12 42 38 PM

How to reproduce the bug

import fitz # PyMuPDF

def extract_text_from_pdf(pdf_path):

Open the provided PDF file

pdf_document = fitz.open(pdf_path)
text = ""
text_block = []

# Iterate through each page of the PDF
for page_num in range(len(pdf_document)):
    # Get the page
    page = pdf_document.load_page(page_num)
    # Extract text from the page
    text += page.get_text()
    print(page.get_text("blocks", sort=False))
    x = page.get_text("blocks", sort=False)
    for word in page.get_text("words", sort=False):
      print(word)

# Close the document
pdf_document.close()
return x

Specify the path to your PDF file

pdf_path = '/content/7.pdf'

Extract text

extracted_text = extract_text_from_pdf(pdf_path) print(extracted_text)

PyMuPDF version

1.23.26

Operating system

MacOS

Python version

3.10

JorjMcKie commented 8 months ago

There is no anomaly. What you see is an image with underlying OCR-ed text. Looking at it via a viewer will show that the match between text-as-image and OCRed text is sloppy to say it politely: just try to mark text line-wise via the mouse and you will see what I mean. With a OCR quality like this, you can rely on nothing. image

JorjMcKie commented 8 months ago

Anyway - thank you for your appreciation of PyMuPDF. Here is how to find blocks of text with the "ignore" attribute - which is usually (not always) used for storing OCR results:

bboxes=page.get_bboxlog()

Every item's first sub-item is the block type. For hidden text the type is "ignore-text".

satvik-27199 commented 8 months ago

Thank you for your quick response. Is there a method I can use to classify these documents? I have approximately 100,000 documents, and some of them may exhibit the issue I described. I need to identify and remove these problematic documents from the corpus. Is there an efficient way to achieve this?

JorjMcKie commented 8 months ago

No - except you want to exclude all OCRed PDFs.

satvik-27199 commented 8 months ago

OCRed PDFs

Yes, we can exclude these documents. Is there any way do it automatically using some function from pymupdf? or we need to do it Manually?

JorjMcKie commented 8 months ago

You can execute above method and check whether "ignore-text" items are present and the page is fully covered by a at least one image. Then it is probably an OCRed page.

JorjMcKie commented 8 months ago
doc=fitz.open("7.pdf")
page=doc[0]
page.get_images()
[(8, 0, 3800, 5550, 1, 'DeviceGray', '', 'img5', 'JBIG2Decode')]
page.get_image_rects(8)
[Rect(0.0, 0.0, 456.0, 666.0)]
page.rect
Rect(0.0, 0.0, 456.0, 674.0)  # image almost covers the page
bbl=page.get_bboxlog()
set([b[0] for b in bbl if "text" in b[0]])
{'ignore-text', 'fill-text'}  # mixture of normal and OCR text on page
satvik-27199 commented 8 months ago

We just tested this on 40-50 documents, and it correctly identifies the OCRed documents. Thank you for the help and the timely response. We really appreciate it and are glad we made the switch to PyMuPDF.

satvik-27199 commented 8 months ago

Looks like for OCRed documents, we need to use the combination of detection transformers + pymupdf to make something work.

63 (2).pdf

I have one more question🙏. In the PDF attached (63 (2).pdf), some column values are very close to each other, for example, '2 Lump Sum'. Obviously, the threshold has not been met to separate them into different blocks. I'm curious if there are any strategies or methods available to adjust this threshold based on an analysis of the whitespace according to the PDF's x-coordinate.

JorjMcKie commented 8 months ago

Let's continue under the "Discussions" tab ...