suggestion on useful api parameters

kingennio commented 2 months ago

Hi there, I'd like to suggest the introduction of some parameters to the to_markdown function that increase flexibility. These comport very minor modification in the code. I've experienced that many presentations (pdf conversions from pptx) have images used as background theme with text on top of them. When I extract the images from the pdf, such text is not recovered because it's in the same rectangle as the background images. It would be useful having a flag in the function that allows controlling such behavior. What I did was modify at line 592 the call: text_rects = column_boxes( page, paths=actual_paths, no_image_text=write_images, textpage=textpage, avoid=tab_rects0 + vg_clusters0, footer_margin=margins[3], header_margin=margins[1], )

into: text_rects = column_boxes( page, paths=actual_paths, no_image_text=write_images & ~always_extract_text, textpage=textpage, avoid=tab_rects0 + vg_clusters0, footer_margin=margins[3], header_margin=margins[1], )

the flag always_extract_text=True allows extracting text even if in the same image-rectangle. to_markdown(......., always_extract_text=False) # default False recovers current behavior

Other 2 useful parameters are:

an "output_dir" parameter (default None) that specifies the folder where to save the extracted images. This allows a cleaner folder structure and not having to move the images by hand or by a separate piece of code.
the "image_extension" (png, jpg, ....) of the extracted images. For example in my case I'd prefer jpg images because they take less bytes, which is important since I send the images to gpt-4o for analysis and extraction.

Thank you for your useful package!

JorjMcKie commented 2 months ago

Thanks for the useful enhancement suggestions! I think we will adopt them all.

JorjMcKie commented 2 months ago

I am implementing your suggestions now. Here is what I intend to, your comments appreciated!

New parameter image_path (your suggestion output_dir) to specify a folder where images should be stored when write_images=True. Alternatively, the existing parameter write_images could be re-used for this: If True the old behavior, if a string, use it as path specification.
New parameter image_format (your suggestion image_extension) to specify the desired image format. All PyMuPDF-supported image output formats in form of the extension: "png" (default), "jpg", etc.
New parameter force_text=None (your suggestion always_extract_text) which exhibits current behavior. If True, all text will be written, even if appearing on background covered by images or graphics.

As per the last point above: force_text=True actually contradicts write_images=True, doesn't it? IAW if write_images=True, the page area containing the image/graphic is written to an image which also shows any text written on top of it. In any case, I think if write_images is not true, then all text should always be written - because otherwise we would in fact lose information: namely all the text on images / graphics: write_images=False implies force_text=True.

The only case to consider therefore is write_images=True and force_text=True.

kingennio commented 2 months ago

Thank you for this new features! "IAW if write_images=True, the page area containing the image/graphic is written to an image which also shows any text written on top of it." Perhaps that is the expected behavior but in my case I had background images with text on them (it was a pdf from a pptx presentation) and when saving the imeages, only the background images are saved w/o the text on top of them. That's the reason I modified the code to allow extracting the text even if it intersects the image rectangle. Thank you again for your utility function. I'm in the process of migrating all my previous RAG extractions using unstructured.io to pymupdf4llm.

JorjMcKie commented 2 months ago

If I could take the radical path "if all text is need then ignore all images/graphics", the best and simplest solution is to remove these things via redaction annotations - and only retain text. The only problem here is graphics that constitute table gridlines - but other than that (you may have no tables to consider), you can do this today easily.

JorjMcKie commented 2 months ago

I am planning to do this inside the code: If write_images=False and force_text=True I "soft-remove" all images and graphics ... after I have extracted tables.

kingennio commented 2 months ago

shouldn't be write_images=True and force_text=True ? or am I missing s.t.?

JorjMcKie commented 2 months ago

Sorry, may have confused things:

write_images==False and force_text==True: write all text (even if other objects underneath)
write_images==False and force_text==False: current behavior (probably nonsense)
write_images==True and force_text==False: current behavior (text with object underneath not repeated as text)
write_images==True and force_text==True: images are written, any text on them also appears as text separately.

So: We should probably write all text always except explicitly excluded (3rd line above).

JorjMcKie commented 2 months ago

Fixed with v0.0.10.

pymupdf / RAG

suggestion on useful api parameters #76