pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
https://pymupdf.readthedocs.io
GNU Affero General Public License v3.0
5.02k stars 482 forks source link

Pymupdf is unable to identify charts in the pdf #3690

Closed Sravan4465 closed 1 month ago

Sravan4465 commented 1 month ago

Description of the bug

I am using pymupdf to identify line spans and pictures found it is unable to identify charts as pictures or as any separate label

How to reproduce the bug

faster_rcnn.pdf

def extract__positions(pdf_path, page_number):
    doc = pymupdf.open(pdf_path)
    page = doc.load_page(page_number)  # zero-based page indexing
    # textpage = page.get_textpage()
    # text = textpage.extractDICT()
    text = page.get_text("dict")
    pix = page.get_pixmap(dpi=200)  # render page to an image
    pix.save("page-%i.png" % page.number)
    img = Image.open(f"./page-{page.number}.png")
    text_positions = []
    block_positions = []
    width = text['width']
    height = text['height']
    pix_width = pix.width
    pix_height = pix.height

    print(width, height, pix_width, pix_height)
    scale = [pix_width/width, pix_height/height]

    blocks = text["blocks"]
    for block in blocks:

        if block["type"] == 0:
            block_positions.append({"type": block['type'],
                                "bbox": block['bbox']})
            for line in block["lines"]:
                for span in line["spans"]:
                    text_positions.append({
                        "text": span["text"],
                        "bbox": span["bbox"]
                    })
        else:
            block_positions.append({"type": block['type'],
                                "bbox": block['bbox']})

    return text_positions, block_positions, width, height, img, scale

visualization code

def visualize_positions(text_positions, block_positions, xscale=1, yscale=1):
    fig, ax = plt.subplots(figsize=(10, 15))

    # Draw each text block
    for item in text_positions:
        bbox = item["bbox"]

        # Calculate the width and height of the block
        x0, y0, x1, y1 = bbox
        width = x1 - x0
        height = y1 - y0

        # Create a rectangle patch
        rect = patches.Rectangle((x0*xscale, y0*yscale), width*xscale, height*yscale, linewidth=1, edgecolor='r', facecolor='none')
        ax.add_patch(rect)

    for item in block_positions:
        bbox = item["bbox"]

        x0, y0, x1, y1 = bbox
        width = x1 - x0
        height = y1 - y0
        if(item["type"] == 0):
           rect = patches.Rectangle((x0*xscale, y0*yscale), width*xscale, height*yscale, linewidth=1, edgecolor='k', facecolor='none')
        else:
           rect = patches.Rectangle((x0*xscale, y0*yscale), width*xscale, height*yscale, linewidth=2, edgecolor='g', facecolor='green', alpha = 0.3)
        # Annotate the text block
        ax.add_patch(rect)
    # ax.add_patch(rect)
    ax.imshow(img)
    plt.axis('off')
    plt.show()

output.pdf

PyMuPDF version

1.24.5

Operating system

MacOS

Python version

3.11

JorjMcKie commented 1 month ago

It is unclear what you are reporting here: a bug? a missing feature?

Have you ever looked at cluster_drawings?

JorjMcKie commented 1 month ago

For the time being, we are going to move this post to Discussions. It seems obvious that no bug is being reported but instead the user's requirements can be addressed by using PyMuPDF features beyond those already employed.