pymupdf / RAG

RAG (Retrieval-Augmented Generation) Chatbot Examples Using PyMuPDF
https://pymupdf.readthedocs.io/en/latest/pymupdf4llm
GNU Affero General Public License v3.0
566 stars 86 forks source link

Extraction of text stops in the middle while working fine with PyMuPDF #191

Open sebastiaanvduijn opened 2 days ago

sebastiaanvduijn commented 2 days ago

as example converting this PDF to markdown https://cache.industry.siemens.com/dl/files/702/109768702/att_998757/v4/109768702_UserAdministration_WinCC_V7.5_en.pdf results in:

4.2.1 Configuration of Access Protection

The following section describes how to configure the access protection of a button.

  1. Open the WinCC "Graphics Designer" with a double click on the entry in the

project directory.

Figure 4-16

**2. The "Graphics Designer" opens with an empty image. Drag and drop a "button"

from the standard library into the image (1) and assign a suitable name (2).**

Note You can link different objects that have access protection with different permissions. You can only assign one permission per object.

User Administration WinCC V7 5


In section 2 it stops extracting text in the middle of the block, the highlighted block should be this instead:

  1. The "Graphics Designer" opens with an empty image. Drag and drop a "button" from the standard library into the image (1) and assign a suitable name (2). Click on the "Authorizations" button (3). The window with the available permissions of the project opens. Select the "User Administration" permission (4). Confirm the selection of permissions with the "OK" button (5). Confirm the configuration of the button with the "OK" button (6).

It does this for multiple PDFs, the data extracted is not complete, the text extraction works for PyMuPDF

sebastiaanvduijn commented 2 days ago

I have been trying to debug where this is coming from, but it seems the text not shown is not in the correct position, due to the clip position taken from the middle. this code is working better but now not all whitespaces are stripped out. hope this helps:

            for sno, s in enumerate(line["spans"]):  # the numered spans
                sbbox = pymupdf.Rect(s["bbox"])  # span bbox as a Rect
                mpoint = (sbbox.tl + sbbox.br) / 2 # middle point

                margin = 0
                if not sbbox.intersects(clip):
                    # expand clip if the span is near the edge
                    if sbbox.x0 < clip.x0 or sbbox.x1 > clip.x1 or sbbox.y0 < clip.y0 or sbbox.y1 > clip.y1:
                        clip = pymupdf.Rect(clip.x0 - margin, clip.y0 - margin, clip.x1 + margin, clip.y1 + margin)
                    else:
                        print(f"Span {s['text']} skipped due to position")
                        continue