Closed belisarenata closed 3 years ago
Hi @belisarenata, and thanks for your interest in this library. Depending on how different each of the non-duplicated pages are, the simplest approach might be just to keep track of whether the result of page.extract_text()
is unique (vis-a-vis previously-seen pages). If that's not precise enough, you could check both the text and, perhaps, a hash of all x0
and top
properties of each object.
I'm working with files that have multiple standard pages that are useless for my purpose and I want to remove them. Is there any method or recommended way to check if there'duplicated pages within pdf?