pymupdf / RAG

RAG (Retrieval-Augmented Generation) Chatbot Examples Using PyMuPDF
https://pymupdf.readthedocs.io/en/latest/pymupdf4llm
GNU Affero General Public License v3.0
302 stars 57 forks source link

suggestion on useful api parameters #76

Closed kingennio closed 2 months ago

kingennio commented 2 months ago

Hi there, I'd like to suggest the introduction of some parameters to the to_markdown function that increase flexibility. These comport very minor modification in the code. I've experienced that many presentations (pdf conversions from pptx) have images used as background theme with text on top of them. When I extract the images from the pdf, such text is not recovered because it's in the same rectangle as the background images. It would be useful having a flag in the function that allows controlling such behavior. What I did was modify at line 592 the call: text_rects = column_boxes( page, paths=actual_paths, no_image_text=write_images, textpage=textpage, avoid=tab_rects0 + vg_clusters0, footer_margin=margins[3], header_margin=margins[1], )

into: text_rects = column_boxes( page, paths=actual_paths, no_image_text=write_images & ~always_extract_text, textpage=textpage, avoid=tab_rects0 + vg_clusters0, footer_margin=margins[3], header_margin=margins[1], )

the flag always_extract_text=True allows extracting text even if in the same image-rectangle. to_markdown(......., always_extract_text=False) # default False recovers current behavior

Other 2 useful parameters are:

Thank you for your useful package!

JorjMcKie commented 2 months ago

Thanks for the useful enhancement suggestions! I think we will adopt them all.

JorjMcKie commented 2 months ago

I am implementing your suggestions now. Here is what I intend to, your comments appreciated!

  1. New parameter image_path (your suggestion output_dir) to specify a folder where images should be stored when write_images=True. Alternatively, the existing parameter write_images could be re-used for this: If True the old behavior, if a string, use it as path specification.
  2. New parameter image_format (your suggestion image_extension) to specify the desired image format. All PyMuPDF-supported image output formats in form of the extension: "png" (default), "jpg", etc.
  3. New parameter force_text=None (your suggestion always_extract_text) which exhibits current behavior. If True, all text will be written, even if appearing on background covered by images or graphics.

As per the last point above: force_text=True actually contradicts write_images=True, doesn't it? IAW if write_images=True, the page area containing the image/graphic is written to an image which also shows any text written on top of it. In any case, I think if write_images is not true, then all text should always be written - because otherwise we would in fact lose information: namely all the text on images / graphics: write_images=False implies force_text=True.

The only case to consider therefore is write_images=True and force_text=True.

kingennio commented 2 months ago

Thank you for this new features! "IAW if write_images=True, the page area containing the image/graphic is written to an image which also shows any text written on top of it." Perhaps that is the expected behavior but in my case I had background images with text on them (it was a pdf from a pptx presentation) and when saving the imeages, only the background images are saved w/o the text on top of them. That's the reason I modified the code to allow extracting the text even if it intersects the image rectangle. Thank you again for your utility function. I'm in the process of migrating all my previous RAG extractions using unstructured.io to pymupdf4llm.

JorjMcKie commented 2 months ago

If I could take the radical path "if all text is need then ignore all images/graphics", the best and simplest solution is to remove these things via redaction annotations - and only retain text. The only problem here is graphics that constitute table gridlines - but other than that (you may have no tables to consider), you can do this today easily.

JorjMcKie commented 2 months ago

I am planning to do this inside the code: If write_images=False and force_text=True I "soft-remove" all images and graphics ... after I have extracted tables.

kingennio commented 2 months ago

shouldn't be write_images=True and force_text=True ? or am I missing s.t.?

JorjMcKie commented 2 months ago

Sorry, may have confused things:

So: We should probably write all text always except explicitly excluded (3rd line above).

JorjMcKie commented 2 months ago

Fixed with v0.0.10.