pymupdf / RAG

RAG (Retrieval-Augmented Generation) Chatbot Examples Using PyMuPDF
https://pymupdf.readthedocs.io/en/latest/pymupdf4llm
GNU Affero General Public License v3.0
303 stars 57 forks source link

Remove headers & Footers #26

Closed G-Slient closed 4 months ago

G-Slient commented 4 months ago
JorjMcKie commented 4 months ago

Thank you for your contribution!

However I am afraid we cannot do it using this approach:

  1. Using redaction annotations is unnecessary. If we know the header / footer block coordinates because of some magic, we can simply skip them in text extraction - without modifying the page.
  2. Taking the first / last text blocks is an incomplete check:
    • text may exist in any sequence on a page - at a minimum we would have to sort the blocks
    • what if the page only has 1 or 2 text blocks - which may not be intended as headers / footers?
    • we also need to check that candidate text is near enough to the top / bottom page borders
    • looking at blocks may be totally inadequate because the page may be 1 text block but still have header / footer lines.

A more complete check would need to scan through all document pages and try to understand whether there are headers / footer at all e.g. by looking for content similarities equal / similar positions on page, etc.

This is quite a complex undertaking - actually belonging in the hands of some upstream AI ...

I think best is to introduce page margins as a parameter. E.g. margins=(left, top, right, bottom). With the option to also just specify margins=(top, bottom) or margins=50 (apply to all 4 borders).

From these values, we would compute a clip rectangle and ignore every outside it.