ENH: Add a remove duplicate pages functionality

py-pdf / pdfly

CLI tool to extract (meta)data from PDF and manipulate PDF files

BSD 3-Clause "New" or "Revised" License

109 stars 18 forks source link

Closed ebotiab closed 5 months ago

ebotiab commented 5 months ago

It could be useful to remove duplicate or nearly duplicate pages inside one or more PDFs. For example:

pdfly rm-dupl with_dupl_rm.pdf with_dupl_pages

One possible approach would be to convert the pdf to images and then remove the ones that have similar image hash.

pubpub-zz commented 5 months ago

pypdf is not a "viewer" and can not generate images from pages. This feature can not be achieved.

ebotiab commented 5 months ago

Makes sense