Closed satvik-27199 closed 8 months ago
There is no anomaly. What you see is an image with underlying OCR-ed text. Looking at it via a viewer will show that the match between text-as-image and OCRed text is sloppy to say it politely: just try to mark text line-wise via the mouse and you will see what I mean. With a OCR quality like this, you can rely on nothing.
Anyway - thank you for your appreciation of PyMuPDF. Here is how to find blocks of text with the "ignore" attribute - which is usually (not always) used for storing OCR results:
bboxes=page.get_bboxlog()
Every item's first sub-item is the block type. For hidden text the type is "ignore-text".
Thank you for your quick response. Is there a method I can use to classify these documents? I have approximately 100,000 documents, and some of them may exhibit the issue I described. I need to identify and remove these problematic documents from the corpus. Is there an efficient way to achieve this?
No - except you want to exclude all OCRed PDFs.
OCRed PDFs
Yes, we can exclude these documents. Is there any way do it automatically using some function from pymupdf? or we need to do it Manually?
You can execute above method and check whether "ignore-text" items are present and the page is fully covered by a at least one image. Then it is probably an OCRed page.
doc=fitz.open("7.pdf")
page=doc[0]
page.get_images()
[(8, 0, 3800, 5550, 1, 'DeviceGray', '', 'img5', 'JBIG2Decode')]
page.get_image_rects(8)
[Rect(0.0, 0.0, 456.0, 666.0)]
page.rect
Rect(0.0, 0.0, 456.0, 674.0) # image almost covers the page
bbl=page.get_bboxlog()
set([b[0] for b in bbl if "text" in b[0]])
{'ignore-text', 'fill-text'} # mixture of normal and OCR text on page
We just tested this on 40-50 documents, and it correctly identifies the OCRed documents. Thank you for the help and the timely response. We really appreciate it and are glad we made the switch to PyMuPDF.
Looks like for OCRed documents, we need to use the combination of detection transformers + pymupdf to make something work.
I have one more question🙏. In the PDF attached (63 (2).pdf), some column values are very close to each other, for example, '2 Lump Sum'. Obviously, the threshold has not been met to separate them into different blocks. I'm curious if there are any strategies or methods available to adjust this threshold based on an analysis of the whitespace according to the PDF's x-coordinate.
Let's continue under the "Discussions" tab ...
Exploring Text and Bounding Box Extraction Anomalies in PDFs with PyMuPDF
Thank you for the excellent work. I'd like to mention an issue related to text and bounding box (bbox) extraction. I've been attempting to extract text values and their corresponding position vectors from tables, and the solution works wonderfully. However, I've encountered a problem with some PDFs (sample attached). When trying to extract information using PyMuPDF, despite there being a significant whitespace between the words "GP" and "Unreserved," it groups them into one block. To understand the root cause, I conducted a word-level bbox extraction and discover ed that the space between "GP" and "Unreserved" is only 2-3 points in the x-coordinate space, which visually does not seem accurate. For example, the space for the "Reserved" vector spans approximately 30 points (274.56 - 244.95). So, why is the gap between "GP" and "Unreserved" only around 3 points (289.20 - 286.53)?
(242.27, 422.79, 339.06, 429.30, ' Reserved GP Unreserved GP\n')
(244.95, 422.79, 274.56, 429.30, 'Reserved', 9, 0, 0) (277.23, 422.79, 286.53, 429.30, 'GP', 9, 0, 1) (289.20, 422.79, 327.09, 429.30, 'Unreserved', 9, 0, 2) (329.76, 422.79, 339.06, 429.30, 'GP', 9, 0, 3)
7.pdf
How to reproduce the bug
import fitz # PyMuPDF
def extract_text_from_pdf(pdf_path):
Open the provided PDF file
Specify the path to your PDF file
pdf_path = '/content/7.pdf'
Extract text
extracted_text = extract_text_from_pdf(pdf_path) print(extracted_text)
PyMuPDF version
1.23.26
Operating system
MacOS
Python version
3.10