Closed kingennio closed 4 weeks ago
I'm also trying to extract a doc that has pages like this page_with_textbox.pdf which is a text block with a colored background. It seems it is always interpreted as a table no matter the table strategy. The only difference is the table itself is different according to the strategy. Since the table is screwed anyway I'd like to at least extract the text but the above params no_image_text=True prevents that because I reckon the table "covers" the text
after further analysis stepping through the code, the problem seems to be the code that checks for drawings. The page.get_drawings() and then page.get_cluster_drawings() find some rect area in vg_clusters0 that will cover the text in the call to colum_boxes
This thread does not represent a bug report or even any issue. The example file simply is a page that contains no text at all (apart from the footer stuff), but instead a bunch of images, 19 in total.
?? why you say so? did you open the page? I have the original pptx and it's just a single text box that happen to have a yellow background. I'm not trying to disparage your work that I highly appreciate, and I've already in the past provided constructive observations and bug findings, and I was just trying to continue the trend. No worries....
Don't be mad with me, but look at this:
page.get_text()
'5\nGruppo TIM - Uso Interno - Tutti i diritti riservati.\n'
That is all the text that exists on the page! Everything in the yellow box is no text. For a simple proof, open the PDF in some PDF viewer and try to select anything with that yellow area - it won't work, no text in there.
But do not assume that the yellow area is just one image: it isn't either that. Look at this:
for img in page.get_image_info(): # iterate over all images and draw their borders
page.draw_rect(img["bbox"], color=(0,0,1))
Point(29.280000686645508, 496.91998291015625)
Point(111.60000610351562, 107.63999938964844)
Point(111.60000610351562, 136.4399871826172)
Point(193.20001220703125, 136.67999267578125)
Point(232.79998779296875, 136.55999755859375)
Point(335.760009765625, 135.60000610351562)
Point(372.9599914550781, 135.24000549316406)
Point(111.60000610351562, 162.9600067138672)
Point(111.36000061035156, 189.95999145507812)
Point(111.36000061035156, 218.16000366210938)
Point(370.55999755859375, 219.24000549316406)
Point(443.4000244140625, 218.51998901367188)
Point(110.27999877929688, 245.1599884033203)
Point(111.12000274658203, 273.5999755859375)
Point(111.24000549316406, 300.9599914550781)
Point(111.12000274658203, 328.9200134277344)
Point(111.0, 411.8399963378906)
Point(151.44000244140625, 410.03997802734375)
Point(110.52000427246094, 434.6399841308594)
doc.ez_save("x.pdf")
This is how the output looks like: So your original PPT text has been converted to zillions of little images.
yes you are absolutely right! I confirm I cannot select the text in the pdf. Yet the pptx is indeed a text block and I simply printed the page to pdf. page.pptx
but oddly enough, stepping through the code, in my case page.get_image_info() only returns the small logo bottom left, whereas it's page.get_drawings() get reports a lot of paths: for i, d in enumerate(page.get_drawings()): print(f'{i}: {d["rect"]}')
0: Rect(0.0, -6.103515625e-05, 960.0, 539.9999389648438) 1: Rect(895.3200073242188, 500.1600036621094, 895.3200073242188, 520.6599731445312) 2: Rect(124.44000244140625, 18.47998046875, 807.1199951171875, 491.0400085449219) 3: Rect(134.02999877929688, 28.839996337890625, 774.22998046875, 53.20001220703125) 4: Rect(134.02999877929688, 63.8900146484375, 224.11000061035156, 86.79000854492188) 5: Rect(233.38999938964844, 64.14999389648438, 272.80999755859375, 82.07998657226562) 6: Rect(281.4700012207031, 64.07000732421875, 397.510009765625, 86.79000854492188) 7: Rect(406.7900085449219, 62.79998779296875, 448.30999755859375, 82.04998779296875) 8: Rect(451.95001220703125, 62.44000244140625, 765.469970703125, 86.79998779296875) 9: Rect(134.02999877929688, 96.04000854492188, 758.3900146484375, 120.3900146484375) 10: Rect(133.6999969482422, 128.95001220703125, 606.22998046875, 153.989990234375) 11: Rect(133.77000427246094, 163.239990234375, 438.45001220703125, 187.60000610351562) 12: Rect(449.0299987792969, 164.510009765625, 528.8400268554688, 187.58999633789062) 13: Rect(537.6699829101562, 163.64999389648438, 754.77001953125, 182.8800048828125) 14: Rect(132.41000366210938, 196.14999389648438, 798.260009765625, 221.20001220703125) 15: Rect(133.5, 230.70001220703125, 365.5899963378906, 254.79000854492188) 16: Rect(133.61000061035156, 264.25, 755.780029296875, 288.3699951171875) 17: Rect(133.39999389648438, 298.010009765625, 333.739990234375, 317.33001708984375) 18: Rect(133.33999633789062, 398.8900146484375, 178.91000366210938, 414.3699951171875) 19: Rect(182.60000610351562, 396.8699951171875, 794.7999877929688, 418.3699951171875) 20: Rect(132.74000549316406, 426.57000732421875, 352.2300109863281, 447.17401123046875)
These are then aggregate with page.cluster_drawings(drawings=paths) in one single large rectangle.
Finally, In my case calling page.get_text() I do get all the text in the box: print(page.get_text()) Dopo aver effettuato i collegamenti(alimentazione, porta DSL per Fttc o Wan per Ftth) il modem si collegherà con HDM(Ns. ACS), che da remoto provvederà ad inserire la username/password corretta abilitando la navigazione(solo per clienti E@syIp senza un sistema di gestione proprietario) e provvederà a verificare/aggiornare il firmware presente, In tal modo l’operatività sul modem per il tecnico è da ritenersi conclusa Nota: in alternativa/se non è possibile con l’automatismo proseguire come da slide seguenti
Well, you definitely did not attach that PDF here! Outputting a document via a virtual PDF printer yields unpredictable results as per the structure of the created PDF. Every such printer does this is in a different way. Use either the PPTX directly or use MS Office or LibreOffice export to PDF to receive better results.
yes indeed you're right, my bad. I used "print" from powerpoint, while I should have used "save as pdf" for the page that I uploaded before. Here's the correct page and the text is selectable. page.pdf Given this new state of the page, is the behavior I reported correct? Meaning the page.cluster_drawings(drawings=paths) concealing all the text even when force_text=True? Plus some weird tables that are detected
Whether you convert the PPT to PDF via MS Office or LibreOffice: The result never is a page with ordinary text on one yellow background. Instead, the text always is surrounded by vector graphics - mostly invisible because of the tone-in-tone colors. But the extraction algorithm cannot differentiate between intention and sloppiness. You effectively have two options at the moment:
.to_markdown("page.pptx", ...)
.We are investigating whether a corresponding option (ignore vector graphics) would make sense here.
import pymupdf
import pymupdf4llm
import sys, pathlib
filename = sys.argv[1]
doc = pymupdf.open(filename)
md = ""
# do header identification once only - instead for each page
hdr_info = pymupdf4llm.IdentifyHeaders(doc)
# loop through pages and remove all vector graphics before text extraction
for page in doc:
page.add_redact_annot(page.rect)
page.apply_redactions(
images=pymupdf.PDF_REDACT_IMAGE_NONE,
graphics=pymupdf.PDF_REDACT_LINE_ART_REMOVE_IF_TOUCHED,
text=pymupdf.PDF_REDACT_TEXT_NONE,
)
md += pymupdf4llm.to_markdown(
doc,
pages=[page.number],
margins=0,
hdr_info=hdr_info,
show_progress=False,
)
pathlib.Path(doc.name + ".md").write_bytes(md.encode())
at line 863 of pymupdf_rag.py the param 'no_image_text' is always set to True. Shouldn't it be opposite of force_text given in input? Like:
text_rects = column_boxes( parms.page, paths=parms.actual_paths, no_image_text=(not force_text), textpage=parms.textpage, avoid=parms.tab_rects0 + parms.vg_clusters0, footer_margin=margins[3], header_margin=margins[1], )