jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.57k stars 659 forks source link

Checking for duplicated pages #324

Closed belisarenata closed 3 years ago

belisarenata commented 3 years ago

I'm working with files that have multiple standard pages that are useless for my purpose and I want to remove them. Is there any method or recommended way to check if there'duplicated pages within pdf?

jsvine commented 3 years ago

Hi @belisarenata, and thanks for your interest in this library. Depending on how different each of the non-duplicated pages are, the simplest approach might be just to keep track of whether the result of page.extract_text() is unique (vis-a-vis previously-seen pages). If that's not precise enough, you could check both the text and, perhaps, a hash of all x0 and top properties of each object.