pymupdf / RAG

RAG (Retrieval-Augmented Generation) Chatbot Examples Using PyMuPDF
https://pymupdf.readthedocs.io/en/latest/pymupdf4llm
GNU Affero General Public License v3.0
539 stars 82 forks source link

Text Overlooked Due to Watermark Detection in PDFs #105

Closed Buckler89 closed 2 months ago

Buckler89 commented 3 months ago

When a PDF contains a watermark in the background (e.g., a diagonal watermark behind text and images across all pages), the watermark is detected as a large image, causing any text that may be present to be ignored. Currently, I haven't been able to remove these watermarks from the PDF using a script.

I’m wondering if pymupdf4llm.to_markdown can handle this situation in some way to ensure the text is still recognized even with the watermark present.

Here’s how I'm currently using it:

md_pages = pymupdf4llm.to_markdown(
    doc=pdf,
    write_images=True,  # Write images to disk with names like "<doc_name>-<page_number>-<image_number>.png"
    # dpi=150,
    image_path=os.path.join(dest_path, "images"),
    image_format="png",
    force_text=False,
    # margins=(0, 0, 0, 0),
    # table_strategy="lines",
    graphics_limit=5000,
    page_chunks=True,
)

Is there a way to adjust the settings or approach so that the text isn’t ignored due to the watermark?

JorjMcKie commented 3 months ago
  1. We cannot handle issues without a reproducer.
  2. Watermark != Watermark - there is a plethora of fundamentally different ways to "watermark" pages. It can be text or images or vector graphics - all of these either in PDF's official way for specifying a watermark or using some arbitrary handcrafted way.
JorjMcKie commented 3 months ago

You can specify force_text=True to cause also that text to appear that happens to be in an image / graphics area.

JorjMcKie commented 2 months ago

Closing this because of lack of reaction over an extended period of time.