Closed kingennio closed 2 months ago
Thanks for the useful enhancement suggestions! I think we will adopt them all.
I am implementing your suggestions now. Here is what I intend to, your comments appreciated!
image_path
(your suggestion output_dir
) to specify a folder where images should be stored when write_images=True
. Alternatively, the existing parameter write_images
could be re-used for this: If True
the old behavior, if a string, use it as path specification.image_format
(your suggestion image_extension
) to specify the desired image format. All PyMuPDF-supported image output formats in form of the extension: "png" (default), "jpg", etc.force_text=None
(your suggestion always_extract_text
) which exhibits current behavior. If True
, all text will be written, even if appearing on background covered by images or graphics.As per the last point above:
force_text=True
actually contradicts write_images=True
, doesn't it?
IAW if write_images=True
, the page area containing the image/graphic is written to an image which also shows any text written on top of it.
In any case, I think if write_images
is not true, then all text should always be written - because otherwise we would in fact lose information: namely all the text on images / graphics: write_images=False
implies force_text=True
.
The only case to consider therefore is write_images=True
and force_text=True
.
Thank you for this new features! "IAW if write_images=True, the page area containing the image/graphic is written to an image which also shows any text written on top of it." Perhaps that is the expected behavior but in my case I had background images with text on them (it was a pdf from a pptx presentation) and when saving the imeages, only the background images are saved w/o the text on top of them. That's the reason I modified the code to allow extracting the text even if it intersects the image rectangle. Thank you again for your utility function. I'm in the process of migrating all my previous RAG extractions using unstructured.io to pymupdf4llm.
If I could take the radical path "if all text is need then ignore all images/graphics", the best and simplest solution is to remove these things via redaction annotations - and only retain text. The only problem here is graphics that constitute table gridlines - but other than that (you may have no tables to consider), you can do this today easily.
I am planning to do this inside the code:
If write_images=False
and force_text=True
I "soft-remove" all images and graphics ... after I have extracted tables.
shouldn't be write_images=True and force_text=True ? or am I missing s.t.?
Sorry, may have confused things:
write_images==False and force_text==True
: write all text (even if other objects underneath)write_images==False and force_text==False
: current behavior (probably nonsense)write_images==True and force_text==False
: current behavior (text with object underneath not repeated as text)write_images==True and force_text==True
: images are written, any text on them also appears as text separately.So: We should probably write all text always except explicitly excluded (3rd line above).
Fixed with v0.0.10.
Hi there, I'd like to suggest the introduction of some parameters to the to_markdown function that increase flexibility. These comport very minor modification in the code. I've experienced that many presentations (pdf conversions from pptx) have images used as background theme with text on top of them. When I extract the images from the pdf, such text is not recovered because it's in the same rectangle as the background images. It would be useful having a flag in the function that allows controlling such behavior. What I did was modify at line 592 the call: text_rects = column_boxes( page, paths=actual_paths, no_image_text=write_images, textpage=textpage, avoid=tab_rects0 + vg_clusters0, footer_margin=margins[3], header_margin=margins[1], )
into: text_rects = column_boxes( page, paths=actual_paths, no_image_text=write_images & ~always_extract_text, textpage=textpage, avoid=tab_rects0 + vg_clusters0, footer_margin=margins[3], header_margin=margins[1], )
the flag always_extract_text=True allows extracting text even if in the same image-rectangle. to_markdown(......., always_extract_text=False) # default False recovers current behavior
Other 2 useful parameters are:
Thank you for your useful package!