Remove headers & Footers

Thank you for your contribution!

However I am afraid we cannot do it using this approach:

Using redaction annotations is unnecessary. If we know the header / footer block coordinates because of some magic, we can simply skip them in text extraction - without modifying the page.
Taking the first / last text blocks is an incomplete check:
- text may exist in any sequence on a page - at a minimum we would have to sort the blocks
- what if the page only has 1 or 2 text blocks - which may not be intended as headers / footers?
- we also need to check that candidate text is near enough to the top / bottom page borders
- looking at blocks may be totally inadequate because the page may be 1 text block but still have header / footer lines.

A more complete check would need to scan through all document pages and try to understand whether there are headers / footer at all e.g. by looking for content similarities equal / similar positions on page, etc.

This is quite a complex undertaking - actually belonging in the hands of some upstream AI ...

I think best is to introduce page margins as a parameter. E.g. margins=(left, top, right, bottom). With the option to also just specify margins=(top, bottom) or margins=50 (apply to all 4 borders).

From these values, we would compute a clip rectangle and ignore every outside it.

pymupdf / RAG

Remove headers & Footers #26