py-pdf / pdfly

CLI tool to extract (meta)data from PDF and manipulate PDF files
BSD 3-Clause "New" or "Revised" License
92 stars 12 forks source link

ENH: Add a remove duplicate pages functionality #54

Closed ebotiab closed 3 months ago

ebotiab commented 3 months ago

It could be useful to remove duplicate or nearly duplicate pages inside one or more PDFs. For example:

pdfly rm-dupl with_dupl_rm.pdf with_dupl_pages

One possible approach would be to convert the pdf to images and then remove the ones that have similar image hash.

pubpub-zz commented 3 months ago

pypdf is not a "viewer" and can not generate images from pages. This feature can not be achieved.

ebotiab commented 3 months ago

Makes sense