When a PDF contains a watermark in the background (e.g., a diagonal watermark behind text and images across all pages), the watermark is detected as a large image, causing any text that may be present to be ignored. Currently, I haven't been able to remove these watermarks from the PDF using a script.
I’m wondering if pymupdf4llm.to_markdown can handle this situation in some way to ensure the text is still recognized even with the watermark present.
Here’s how I'm currently using it:
md_pages = pymupdf4llm.to_markdown(
doc=pdf,
write_images=True, # Write images to disk with names like "<doc_name>-<page_number>-<image_number>.png"
# dpi=150,
image_path=os.path.join(dest_path, "images"),
image_format="png",
force_text=False,
# margins=(0, 0, 0, 0),
# table_strategy="lines",
graphics_limit=5000,
page_chunks=True,
)
Is there a way to adjust the settings or approach so that the text isn’t ignored due to the watermark?
Watermark != Watermark - there is a plethora of fundamentally different ways to "watermark" pages. It can be text or images or vector graphics - all of these either in PDF's official way for specifying a watermark or using some arbitrary handcrafted way.
When a PDF contains a watermark in the background (e.g., a diagonal watermark behind text and images across all pages), the watermark is detected as a large image, causing any text that may be present to be ignored. Currently, I haven't been able to remove these watermarks from the PDF using a script.
I’m wondering if
pymupdf4llm.to_markdown
can handle this situation in some way to ensure the text is still recognized even with the watermark present.Here’s how I'm currently using it:
Is there a way to adjust the settings or approach so that the text isn’t ignored due to the watermark?