Open sebastiaanvduijn opened 2 days ago
I have been trying to debug where this is coming from, but it seems the text not shown is not in the correct position, due to the clip position taken from the middle. this code is working better but now not all whitespaces are stripped out. hope this helps:
for sno, s in enumerate(line["spans"]): # the numered spans
sbbox = pymupdf.Rect(s["bbox"]) # span bbox as a Rect
mpoint = (sbbox.tl + sbbox.br) / 2 # middle point
margin = 0
if not sbbox.intersects(clip):
# expand clip if the span is near the edge
if sbbox.x0 < clip.x0 or sbbox.x1 > clip.x1 or sbbox.y0 < clip.y0 or sbbox.y1 > clip.y1:
clip = pymupdf.Rect(clip.x0 - margin, clip.y0 - margin, clip.x1 + margin, clip.y1 + margin)
else:
print(f"Span {s['text']} skipped due to position")
continue
as example converting this PDF to markdown https://cache.industry.siemens.com/dl/files/702/109768702/att_998757/v4/109768702_UserAdministration_WinCC_V7.5_en.pdf results in:
4.2.1 Configuration of Access Protection
The following section describes how to configure the access protection of a button.
project directory.
Figure 4-16
**2. The "Graphics Designer" opens with an empty image. Drag and drop a "button"
from the standard library into the image (1) and assign a suitable name (2).**
Note You can link different objects that have access protection with different permissions. You can only assign one permission per object.
User Administration WinCC V7 5
In section 2 it stops extracting text in the middle of the block, the highlighted block should be this instead:
It does this for multiple PDFs, the data extracted is not complete, the text extraction works for PyMuPDF